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Preface 



This edited volume addresses problems in computer vision involving multiple 
images. The images can be taken by multiple cameras, in different spectral bands 
(multiband images), at different times (video sequences), and so on. Computer 
vision research has to deal with multi-image or multi-sensor situations in varying 
contexts such as, for instance, 

— image databases-, representations of similar situations, objects, processes, and 
related search strategies, 

— 3D shape reconstruction-, binocular, trinocular, and multiple- view stereo, 
structured light methods, photometric stereo, shape from multiple shadows, 
registration and integration of partial (or single view) 3D reconstructions, 
and 

— augmented reality-, multi-node panoramic scenes, omniviewing by special ca- 
meras, video-to-(still)wide angle image generation, incremental surface vi- 
sualization, or more advanced visualization techniques. 

Recently multi-image techniques have become a main issue in image technology. 

The volume presents extended and updated versions of 20 talks given at the 
10**' International Workshop on Theoretical Foundations of Computer Vision 
(March 12 - 17, 2000, Schloss Dagstuhl, Germany). Chapters are grouped into 
four parts as follows: (i) 3D Data Acquisition and Sensor Design; (ii) Multi- 
Image Analysis; (Hi) Data Fusion in 3D Scene Description; and (iv) Applied 3D 
Vision and Virtual Reality. They cover various theoretical, algorithmic, and im- 
plementational issues in multi-image acquisition, storage, retrieval, processing, 
analysis, manipulation, and visualization. 
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Abstract. Spherical cameras are variable-resolution imaging systems 
and promising devices for autonomous navigation purposes, mainly be- 
cause of their wide viewing angle which increases the capabilities of 
vision-based obstacle avoidance schemes. In addition, spherical lenses 
resemble the primate eye in their projective models and are biologically 
relevant. However, the calibration of spherical lenses for Computer Vi- 
sion is a recent research topic and current procedures for pinhole camera 
calibration are inadequate when applied to spherical lenses. We present 
a novel method for spherical-lens camera calibration which models the 
lens radial and tangential distortions and determines the optical center 
and the angular deviations of the CCD sensor array within a unified 
numerical procedure. Contrary to other methods, there is no need for 
special equipment such as low-power laser beams or non-standard nu- 
merical procedures for finding the optical center. Numerical experiments, 
convergence and robustness analyses are presented. 



1 Introduction 

Spherical cameras are variable-resolution imaging systems useful for autonomous 
navigation purposes, mainly because of their wide viewing angle which increases 
the capabilities of vision-based obstacle avoidance schemes HH. In addition, 
spherical lenses resemble the primate eye in their projective models and are 
biologically relevant ^ . In spite of this, the calibration of spherical lenses is not 
well understood and contributions to this topic have only recently begun to 
appear in the literature. 

Current standard procedures for pinhole camera calibration are inadequate 
for spherical lenses as such devices introduce significant amounts of image dis- 
tortion. Calibration methods such as Tsai’s uni only consider the first term of 
radial distortion which is insufficient to account for the distortion typically in- 
duced by spherical lenses. Other calibration procedures for high distortion and 
spherical lenses such as Shah and Aggarwal’s |0| and Basu and Licradie’s 0 have 
been defined. However, these methods use special equipment such as low-power 
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laser beams or ad-hoc numerical procedures for determining the optical center 
of spherical lenses. We propose a novel method which only requires an adequate 
calibration plane and a unified numerical procedure for determining the optical 
center, among other intrinsic parameters. 



1.1 Types of Distortion 

The calibration of optical sensors in computer vision is an important issue in 
autonomous navigation, stereo vision and numerous other applications where 
accurate positional observations are required. Various techniques have been de- 
veloped for the calibration of sensors based on the traditional pinhole camera 
model. Typically, the following types of geometrical distortion have been recog- 
nized and dealt with Q: 

— Radial Distortion: This type of distortion is point-symmetric at the optical 
center of the lens and causes an inward or outward shift of image points from 
their initial perspective projection. About the optical center, radial distortion 
is expressed as 

f = r + Kir'^ + K 2 T^ + ^ , ( 1 ) 

where Hi are radial distortion coefficients, r is the observed radial component 
of a projected point and f, its predicted perspective projection j7j. 

— Decentering Distortion: The misalignment of the optical centers of var- 
ious lens elements in the sensor induces a decentering distortion which has 
both a radial and a tangential component. They are expressed as 

f = r -I- 3(?7ir^ -|- r] 2 r^ + H ) sin(0 - 0 q) 

9 = 9+ (? 7 ir^ -I- r] 2 r‘^ + rj^r^ H ) cos(0 - 6<o), (2) 

where rji are the decentering distortion coefficients, 9 is the observed angular 
component of a projected point, 9 is its predicted perspective projection 
and 9q is the angle between the positive j/-axis and the axis of maximum 
tangential distortion due to decentering 0. 

— Thin Prism: Manufacturing imperfections of lens elements and misalign- 
ment of CCD sensor arrays from thier ideal, perpendicular orientation to the 
optical axis introduce additional radial and tangential distortions which are 
given by 



f = r+ (Cir^ -I- (2r"^ + H ) sin(0 - 0i) 

9 = 9+ (Cir^ + C 2 r" + Car® + • • •) cos{9 - 9i), (3) 

where Q are the thin prism distortion coefficients and 9\ is the angle between 
the positive y-axis and the axis of maximum tangential distortion due to thin 
prism [Zj. 
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1.2 Related Literature 

The need for foveated visual fields in active vision applications has motivated the 
design of special-purpose spherical lenses 0 and catadioptric sensors |2|. These 
imaging systems introduce significant amounts of radial and possibly tangential 
distortions (see Figure |3) and traditional methods that only calibrate for the 
perspective projection matrix and neglect to compensate for these distortions 
are inadequate m- 

The calibration methods designed for high-distortion lenses typically model 
the radial and tangential distortion components with polynomial curve-fitting. 
Examples of such methods are Shah and Aggarwal 'sm and Basu and Licardie’s 
0. Both of these methods calibrate the optical center by using procedures that 
are not elegantly integrated into the curve-fitting procedure which recovers dis- 
tortion coefficients. For instance, Basu and Licaride’s method consists of a mini- 
mization of vertical and horizontal calibration-line curvatures whereas Shah and 
Aggarwal’s requires the use of a low-power laser beam based on a partial reflec- 
tion beam-alignment technique. 

Other, similar methods perform minimizations of functionals representing 
measures of the accuracy of the image transformation with respect to calibra- 
tion parameters [til 1 4IJ . These methods rely on the point-symmetry of radial 
distortion at the location of the optical center onto the image plane to reduce 
the dimensionality of the parameter space jOI or to iteratively refine calibration 
parameters initially obtained with a distortion-free pinhole camera model m. 

In addition to these calibration techniques, Miyamoto 0 defined mappings 
relating the world plane angle 9i to the image plane angle 02 ■ One such mapping 
is given by 02 = tan^i (see Figure 0). Alternatively, Anderson et al. [H defined a 
similar mapping this time based on Snell’s law of diffraction. Unfortunately, the 
accuracy of these models is limited to the neighborood of the optical center m- 
Basu and Licardie also proposed alternative models for fish-eye lenses based 
in log-polar transformations P| but, in this case, they demonstrate that the 
small number of calibration parameters does not permit to accurately model a 
spherical lens. 



2 Standard Procedure for Fish-Eye Lens Calibration 



The number of free intrinsic parameters for a typical high distortion lens is large, 
especially when one considers sources or radial distortions, decentering and thin 
prism, manufacturing misalignments such as tilt, yaw and roll angles of the CCD 
sensor array with respect to its ideal position, image center versus optical center, 
etc. We encompass radial and tangential distortions in two polynomials for which 
the coefficients are to be determined with respect to the sources of distortion 
emanating from the location of the optical center and the pitch and yaw angles 
of the CCD sensor. We proceed by describing the least-squares method chosen 
to perform the polynomial fits for both radial and tangential distortions. 
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Point in real-world plane 




Fig. 1. The image plane and world plane angles 6i and O 2 are the angles formed by 
the projective rays between the image plane and the world plane, both orthogonal to 
the optical axis. 



2.1 Radial and Tangential Polynomials 

Given a set of calibration points and their image locations, the equations de- 
scribing the transformation from fish-eye to pinhole are 

L L 

Oij = ^ ak0ij and fij = ^ (4) 

fc=o fe=o 



where L is the order of the polynomials and Oij and Vij are the corrected polar 
coordinates of the calibration points. We use a calibration pattern for which 
the points align into horizontal, diagonal and vertical lines. These calibration 
points may be arranged in matrix form consistent with their geometric location 
on the calibration plane: 



Pll 


P12 . 


• • Pin 


P21 


P22 • 


• • P 2 n 


Pnl 


Pn 2 • 


P 

• • nn 



Pii P12 • • • Pin 
P2I P22 • • • P2n 

Pnl Pn2 • • • Pnn 



Pll P12 • ■ • Pin 
P2I P22 • ■ • P2n 

Pnl Pn2 • • • Prm 



( 5 ) 



where P^ = {Xij,Yij, Zij) are the 3D calibration points expressed in the coor- 
dinate system of the camera, Pij = (fij,9ij) are the 2D projection of Pij onto 
the pinhole camera and pij = {rij,9ij) are projection of P^ as imaged by the 
spherical lens. 

Various minimization methods may be applied to the polynomials in order 
to determine their coefficients. For instance, Lagrangian minimzation and least- 
squares have been used. For our purposes, we adopt a least-squares approach to 
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Fig. 2. Radial and tangential distortions. The original point, expressed as (r,0) is the 
expected observation. The distorted point as observed, is expressed as (r + 5r, 0 + Sff), 
where Sr and S0 are the radial and tangential distortions, respectively. 



find the polynomial coefficients and perform the correction. This least-squares 
fit for the radial and tangential distortion polynomial can be expressed as 

n n / L n n / L 

H and H • (6) 

i—1 j — 1 \ k—0 / 2—1 3 = 1 \ fc— 0 / 

Deriving the polynomials with respect to coefficients yields the following systems 
of linear equations 

a^0 = OijOij and b^R^- = rijVij (7) 



a = (oO) • ■ • , ol)^ 
b = {bo,...,bLf 

R _ r 

&ij = 



where 
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We write the general least-squares system of equations in matrix form as 

n n L n n 



i—l j — 1 k=Q i—1 j—1 

n n L n n 

2=1 j — 1 k—[) i—l j—1 



r,0 



and 



n n L n n 

i—l j — 1 fc=0 i—l j—1 



i—l j — 1 k—0 i—l j—1 

n n L n 71 



i—l j — 1 k—Q 



2=1 i=l 



( 8 ) 



EEE‘'‘4i>!; = EEi>-i‘> 

i=l j — 1 k—0 i—l j—1 

The least-squares matrices may be written as 



(9) 





(r°n ■ 






(o\, ■ 








r>0 . 

'12 






0\2 ■ 


' ■ 0i2 




II 


'In 


■ ■ 

' In 


Ae = 


Olu ■ 


■■oL 


(10) 




r>0 . 

'21 


• . 

' 2n 




Oh ■ 


■ ■ oL 






\ nn 


■ * ) 
' nn / 






■■ol) 





and we form the least-squares systems of equations as Rga = 0 and R^b = r, 



where Rg = AjAg, R^ = A^A^, r = Ajc^., 6 = Ajcg and 









( \ 


eg = 


0\2 


Cj. = 


fl2 




\0nnJ 




\'^nn / 



The coefficients a and b are such that they should minimze Xe = |Aea — 
and Xr = |Aj.b — We use Singular Value Decomposition (SVD) to perform 
the least-squares fits 
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a = Vediag(We)(Ujc) (11) 

b = Vrdiag(W^)(U^c^) (12) 

where = UeWgV^ and = U^W^V^, and to compute Xe and Xr- We 
use the notation a(xc,Xp), b(xc, Xp, 6»u, fi'v), Xg(xc,Xp) and Xr(xc, Xp, 6»u, 6>v) to 
indicate that the least-squares solutions for tangential distortion coefficients a 
and the residual x^ depend on Xc, the location of the optical center with respect 
to the coordinate system in which the fit is performed and Xp, the translation 
parallel to the calibration surface, and that the radial distortion coefficients b 
and the residual x^ depend on the optical center x^, the camera translation Xp 
and 0u and 0 ^ , the pitch and yaw angles of the CCD sensor array with respect to 
a plane perpendicular to the optical axis. We further explain and experimentally 
demonstrate these dependencies in sections 2.3 and 2.4. 

2.2 Polynomial Order 

The overfit of data, or polynomial orders that exceed the intrinsic order of the 
data, constitutes our primary motivation for using SVD in the least-squares so- 
lutions of the polynomial coefficients. For instance, if any of the singular values 
is less than a tolerance level of 10“®, we set its reciprocal to zero, rather than let- 
ting it go to some arbitrarily high value. We thus avoid overfits of the calibration 
data when solving for a(Xc,Xp) and b(Xc, Xp, 0u, 0v) in lit til and C3). Because 
of this capability and considering that the computational cost of calibration is 
usually not critical when compared with real-time vision computations, we use 
polynomials of order L = 12. 

2.3 The Optical Center 

The optical center of a lens is defined as the point where the optical axis passing 
through the lens intersects the image plane of the camera. Alternatively, the 
optical center is the image point where no distortions appear, radial or tangential. 
That is to say, where fij = Vij and 9ij = 9ij. 

2.4 The Optical Center 

The optical center of a lens is defined as the point where the optical axis passing 
through the lens intersects the image plane of the camera. Alternatively, the 
optical center is the image point where no distortions appear, radial or tangen- 
tial. That is to say, where fij = Vij and 9ij = 9ij. In addition, radial distortion 
is point-symmetric at the optical center and, consequently, the one-dimensional 
polynomial in r is accurate only when aligned with the optical center. Figure 
01 shows plots of (fij,rij) and (9ij,9ij) at and away from the optical center, 
in which the point-scattering effect becomes apparent as the ploynomial fit is 
gradually decentered from the optical center. This effect is reflected in the val- 
ues of x^(xc, Xp, 0u, 6lv) and Xg(xc,Xp) functions around the optical center, as 
illustrated by Figure 0] 
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Tangential Components at the OpticaJ Center Tangential Components at {SS.O^.O) from the Optical Center 




Radial Components at (5.0, 5.0) from Ophcal Center 




Tangentl^ Components at (50.0,50.0) from Ae Optica) Center 




Distorted tangential components 



Fig. 3. Plots of (rij,rij) and {9ij, dij). a) (top, from left to right): rij and Vij at the 
optical center, (2.5, 2.5) and (5.0, 5.0) image units away from it. b) (bottom, from 
left to right): 6ij and 9ij at the optical center, (25.0, 25.0) and (50.0, 50.0) image units 
away from it. The increasing scattering of the plots as the distance from the optical 
center increases prevents accurate modelling of the lens. The effect is most apparent 
for the rij’s, yet it is also observed with the 9ij’s. 



Trarslation from Optical Center Translation from Opticat Center Transiatlon from Opticat Center 




Fig. 4. Effect of translation from the optical center on Xr{'^c,Xp, 9 ^, 9 ^) and Xe(xc, Xp). 
a) (left): Plot of the Xr(x — Xc, Xp, 9 ^, 9 ^) function, b) (center): Plot of the Xei^ ~ 
Xc,Xp) function, c) (right): Plot of the Xr(x — Xc, Xp, 0u, 0v) + Xe(^~ Xp) function. 



2.5 CCD Sensor Array Misalignments 



CCD sensor misalignments are due to imperfections at the time of assembly. 
These imperfections, however minute, introduce additional noise as some types 
of misalignments influence the value of the Xp(xc, Xp, 0u, function. We have 
studied the effect of such misalignments by rotating the image plane of the 
synthetic camera model about its origin. Figure Elshows the Xp(xc, Xp, ^v) 
and Xdi^c, Xp) functions for rotations 9u, 9v and about the u, v and n axes of 
the synthetic camera. The effects have been studied in isolation to one another 
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and, in these experiments, the optical center projected onto the origin of the 
synthetic camera. 

As expected, rotations about the line of sight axis n have no effect on the 
X^(xc, Xp, 6*u, 0v) function, as they do not break the point-symmetry of radial 
distortion. However, rotations about the axes of the image plane u and v intro- 
duce errors reflected in x^(xc, Xp, 0u, 6*v) (see Figure EK). As expected, this type 
of rotation breaks the point-symmetry of radial distortion. 

In all three types of rotations, the yg(xc,Xp) function remains undisturbed, 
as shown in Figure Et- Since the position of the optical center is not shifted by 
the rotations, no violation of the line-symmetry of the tangential distortion is 
introduced. If such rotations were to be centered away from the image position 
of the optical center, then errors would be introduced because of the breaking 
of the line-symmetry. This is also illustrated by Figure El where, for the three 
types of rotation, the plots of (0^, 6ij) describe a bijection and do not indroduce 
approximation errors in the fit, contrary to the plots of in Figure!^. 






Fig. 5. Effect of CCD array rotation on y^(xc, Xp, 0u, 6v) and yg(xc,Xp) functions, a) 
(top, from left to right): The y^(xc, Xp, 6u, ^v) resudual function against rotations 
around the u, v and n axes, b) (bottom, from left to right): The yg(xc, Xp) residual 
function against rotations around the u, v and n axes. 



Another phenomenon affecting the value of the residual is the alignment of 
the synthetic pinhole calibration dots with the spherical points as imaged by the 
lens. Given an ideal situation in which the central calibration point is imaged 
at the image center and that this location coincides with the optical center, 
then the residual is at a minimum. However, any deviation from this situation 
substantially increases the value of the residual, and for certain is by no means 
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related to the calibration parameters of the camera. Additionally, we cannot 
require that the central calibration dot be imaged at the optical center, since it 
is one of the parameters to be estimated. 

In light of this, we also model translation of the camera parallel to the cali- 
bration plane as translation of the synthetic pinhole calibration points Pij . Con- 
sequently, the calibrtation method must minimize the residual with respect to 
the following parameters: 

— Xc: The amount of translation of imaged spherical points Pij, which models 
translation of the CCD sensor array in the (u, v) plane. In other words, Xc 
is the translation from the image center to the optical center. 

— Xp! The amount of translation of the synthetic pinhole calibration points 
Pij, which models the translation of the camera in the (X,Y) plane, parallel 
to the calibration surface. 

— 6>u, Ov- The pitch and yaw angles of the CCD sensor array. 




Fig. 6. Plots of {0ij,0ij) under rotations of 0.8 radians around a) (left): the u axis, 
b) (center) the v axis and c) (right): the n axis. 



3 Synthetic Camera Model 

We calibrate against a standard, synthetic pinhole camera described by linear 
transformation matrices containing the instrinsic parameters to be calibrated. 
The first transformation is from the world coordinate system to that of the syn- 
thetic camera, expressed by the camera position r in world coordinates and or- 
thogonal unit vectors u = {ux, Uy, UzY' , v = {vx, Vy, VzY' ^nd n = {ux, riy, Uz)’^ ■ 
In addition, since the vector joining the image plane at the optical center and 
the focal point may not be perpendicular to the image plane, we model the focal 
length in the coordinate system of the camera as a vector f = (/„, /„, /„)^. The 
translation from optical center to image center x^ = (a^cij/c)^ and the scaling 
factors Sx and Sy from synthetic camera image to real image also are parameters 
forming the synthetic camera model. Combining these into a homogeneous linear 
transformation yields the matrix C: 
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/ - /„ ^n^ifuSx - Xc) SyVx - fn ^rixifvSy - Vc) rix -/„ '^Ux \ 

SxUy — /„ ny{fuSx — Xc) ~ fn ''^yifvSy ~ t/c) ~fn 

^x'^z fn ’^z{fu^x Xc) SyVz fn ’^z{fv^y Vc) ’^z fn 

\Sxr'x, - fn^KifuSx - Xc) + Xc SyV'y - f~^r'^{fvSy - Vc) + Vc r'^ fn^K + 1/ 

where = — r^u, = — r^v and r'^ = — r^n. Planar points are projected 
onto the imaging plane of the pinhole camera as = p^. To obtain the 

points Pij as imaged by a hypothetical spherical lens, we use the fish-eye trans- 
form due to Basu and Licardie to distort the Pij’s. The fish-eye transformation 
is given by 

Pij = slog(l -h X\\p^J\\ 2 )p^j (13) 

where Py = (xij,yij)'^, p„ = {x,j,y,j)'^, p^j = (cos^,sin^)^, and ^ = tan“i |^. 
The symbols s and A are scaling and radial distortion factors, respectively. 

4 Description of Algorithm 

As a first step, we generate calibration points using the synthetic pinhole camera. 
The analytic calibration plane is conveniently located in the (A, Y) plane of the 
world coordinate system and the line of sight of the pinhole camera coincides 
with the Z axis. 

The synthetic image plane is at 340 mm from the calibration plane and the 
focal length is set to 100 mm. The pinhole calibration points are then projected 
onto the image plane of the synthetic camera as C'^Pij = p^ and kept in polar 
coordinates as {vij^Oij). 

Using the spherical camera, oriented perpendicularly from the real calibration 
plane, a frame of the calibration points is grabbed. The lens of the spherical 
camera is at 280 mm from the calibration plane. Figure 0) and c show such 
frames. We perform point detection on this image by computing the centroids of 
the calibration points and obtain spherical image points (rij,0ij). Both sets of 
points (rij,9ij) and {rij,9ij) are scaled to the canonical space [(— 1, — |), (1, f )] 
where the minimization procedure is to begin. 

We use a gonjugate gradient minimization procedure due to Polak-Ribiere 
0 which we apply on the function = Xri^c, Xp, 0u, (^v) + Xy(xc, Xp). In order 

o2 o2 o2 ^2 q2 

to perform the minimization, the partial derivatives -g^, g^, g^, g^ and 

g^ need to be evaluated for various values of (xc, Xp, 9^, 9^)- 

To evaluate the partial derivatives with respect to Xc, we perform transla- 
tions of the detected spherical calibration points Pij = {rij , 9ij ) onto the image 
plane and perform least-squares fits to obtain the values then used for com- 
puting 5-point central differences. Evalutation of partial derivatives with respect 
to CCD array angles is more involved. The first step is to reproject the pinhole 
calibration points Pij back onto the calibration plane using C~^, the inverse 
of the pinhole camera transformation. Rotations of these reprojected points in 
3D and reprojection onto the image plane of the pinhole camera provide the x^ 
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values for computing 5-point central differences. The minimzation is performed 
with the shifted and rotated calibration points and is guided by the 6D gradient 
vector (|y-, f^)- The output of the algorithm is the optical 

center Xc, represented as the shift from the image center, the camera translation 
Xp parallel to the calibration surface with respect to the central calibration point, 
the CCD sensor array pitch and yaw angles 9^ and 9^ and the polynomials in r 
and 9 for image transformation from spherical to pinhole. In essence, the proce- 
dure is to find the parameter values that best explain the detected calibration 
points as imaged by the spherical lens. 



5 Numerical Results 

We study the convergence rate of the calibration procedure, its resistance to in- 
put noise and the results obtained with the calibration images of Figure Eb and 
c, corresponding to spherical cameras A and B, respectively. Figure^ shows a 
typical frame taken by a spherical camera, while and c show frames of the 
calibration plane grabbed with our spherical cameras A and B. The calibration 
plane has a width and height of 8 feet and the 529 calibration dots are spaced by 
4 inches both horizontally and vertically. In order to capture the calibration im- 
ages, the spherical cameras are mounted on a tripod and approximately aligned 
with the central calibration dot. The sperical lenses are at a distance of 280 mm 
from the calibration plane. 

The convergence and noise resistance study is performed with a simulated 
spherical lens. We use equation m in order to compute the spherical points 
Pij from the synthetic pinhole calibration dots . To model CCD sensor array 
misalignments, we perform 3D rotations of the synthetic pinhole camera and 
reproject the synthetic calibration points onto the so rotated image plane prior to 
using (0. In addition, we translate the spherical calibration points Pij to model 
the distance of the optical center from the center of the image and also translate 
the synthetic pinhole calibration points Pij to model the camera translation 
parallel to the calibration surface. 

Input noise is introduced in each synthetic pinhole calibration dot pij as 
Gaussian noise with standard deviations cr^ and Uy expressed in image units 
(pixels). This step is performed before using (11311 and models only the positional 
inaccuracy of calibration dots. We proceed to evaluate the performance of the 
calibration procedure with respect to convergence rates and input noise levels 
with a simulated spherical lens and present experiments on real spherical camera 
images (our spherical cameras A and B) for which we have computed their 
calibration parameters. 



5.1 Convergence Analysis 

In order to study the convergence rate of the calibration method, we monitored 
the values of the error function with respect to the number of iterations 



Modelling and Removing Radial and Tangential Distortions 



13 




Fig. 7. a) (left): A typical image from a spherical lens camera, b) (center): Image 
of the calibration plane grabbed with spherical camera A. c) (right): Image of the 
calibration plane grabbed with spherical camera B. 



performed in the 6D minimization procedure using the Polak-Ribiere conju- 
gate gradient technique. Figure 0 reports three experiments performed with 
various calibration parameters. The start of the 6D search always begins at 
(xc, Xp, 0v) = 0 and, as expected, the number of required iterations to con- 
verge to the solution is proportional to the distance of the calibration parameters 
to the initial search values. We used a tolerance of 1 x 10“® on convergence and 
we computed the various derivatives of the error function with 5-point dif- 
ferences with intervals of 0.2 image units for translation and intervals of 0.0002 
radians for rotations. 

As figure Eldemonstrates, convergence rates are steep and, in general, 40 iter- 
ations are sufficient to obtain adequate calibration parameters. Figure Et shows 
the convegence for calibration parameters (xc, Xp, 0u, 6*v) = (5.0, 5.0, —0.1, 0.01); 
Figure 0 d) shows the convergence for calibration parameters (15.0, —5.0, 0.0, 0.2) 
and Figure El, for (15.0, —15.0, —0.1, 0.2). 





Fig. 8. Convergence analysis of for various configurations of calibration parameters 
(x„,Xp,6»u,6lv). a) (left): (5.0, 5.0, -0.1, 0.01). b) (center): (15.0, -5.0, 0.0, 0.2). c) 
(right): (15.0, -15.0, -0.1, 0.2). 
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5.2 Noise Robustness Analysis 

In order to determine the robustness of the procedure with respect to input 
noise, we introduced various levels of Gaussian noise into the synthetic pin- 
hole calibration dots. We used zero-mean Gaussian noise levels of \\{(Tj;, (Ty )\\2 = 
0, 1.4142, 2.8284, 4.2426, 5.6569 and 7.0711, expressed in image units. The effects 
of noise onto the calibration parameters Xc, Xp, Ou and 6*v and the the values of 
the residual are depicted by the graphs of Figure El which show these values 
for the noise levels we chose. As can be observed, the ground truth calibration 
parameters (xj,, Xp, 0v) = 0 show a linear behavior to input noise whereas 
the residual shows a quadratic growth with respect to input noise. 






Fig. 9. The effect of input zero-mean Gaussian noise on the calibration parameters 
and the residnal a) (left): The behavior of ||xc ||2 with respect to inpnt noise levels 
II (ua;, o-p)|| 2 . b) (center): The behavior of ||(0u,0v)||2 and c) (right): the behavior of 



5.3 Calibration of Spherical Images 

We have applied our calibration procedure to both of our spherical cameras and 
determined their calibration parameters. Tables d and d show the parameters 
obtained from spherical cameras A and B, respectively. Figure ITTil a.nd fTTIshow 
the synthetic pinhole calibration points, the spherical points detected from cal- 
ibration images Eb and c, and the polynomial reconstruction of those detected 
points with the calibration coefficients Oi and bi. 

As figure 11111 : demonstrates, our spherical camera A has a serious assembly 
misalignment. The yaw angle is in excess of 0.16 radians. However, spherical 
camera B does not show such misalignments and FigurefTTh shows a quasi fronto- 
parallel polynomial reconstruction of the detected spherical calibration points. 
In the case of camera A, the misalignment of the GGD array is visible by careful 
visual examination of the device. 

5.4 Removing Distortion in Spherical Images 

The transformation polynomials Oij and fij represent a mapping from spheri- 
cal to perspective image locations. However, to compensate for distortion, the 
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Table 1. The calibration parameters for spherical camera A. 



Calibration Parameters for Spherical Camera A 


Tangential Distortion Coefficients 


fli 




03 


04 


Os 


QiQ 


-0.0039 


3.3274 


-0.0216 


-0.1836 


0.0166 


-1.3416 


ar 


0-8 


Og 


Oio 


Oil 


ai 2 


-0.1516 


0.6853 


0.2253 


-0.3347 


-0.0879 


-0.0092 


Radial Distortion Coefficients bi 


bi 


i>2 


b3 


64 


bs 


bo 


199.6790 


-2634.8104 


13799.4582 


-26999.8134 


8895.5168 


23348.2599 


67 


bs 


bg 


bio 


611 


612 


4858.0468 


-17647.3126 


-24277.7749 


-12166.4282 


12108.0938 


40070.6891 


Singular Values ua for X( 




LOl 


U)2 


W3 


OJ4 


^5 


LOo 


23.0001 


16.2318 


9.6423 


5.4287 


2.6397 


1.3043 


OJ7 


tOs 


Ulg 


OJlO 


LOll 


LO12 


0.5012 


0.2400 


0.0736 


0.0325 


0.0068 


0.0028 




Singular Values u>i for xt 




UJl 


CO2 


UI3 


U 4 


(-05 


LOo 


525.1062 


50.7337 


22.1506 


7.6035 


2.3874 


0.6154 


LU7 


Wg 


UJg 


ono 


LOll 


L012 


0.1383 


0.0260 


0.0 


0.0 


0.0 


0.0 




Vc 


0 U 


0 . 




-0.0753 


-3.2792 


-0.0314 


-0.1722 


0.0543 




Fig. 10. Calibration experiment with spherical camera A. a) (left): The pinhole cali- 
bration points, as imaged by the synthetic camera, b) (center): The spherical points as 
detected from image in FigureQ). c) (right): The polynomial reconstruction obtained 
for this set of calibration points. 
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Table 2. The calibration parameters for spherical camera B. 



Calibration Parameters for Spherical Camera B 


Tangential Distortion Coefficients ai 


tti 


02 


as 




as 


06 


-0.0097 


3.1918 


-0.0053 


-0.2562 


0.0658 


0.1847 


a? 


as 


ag 


aio 


an 


ai 2 


-0.1615 


0.4940 


0.1577 


-0.8093 


-0.0553 


0.3371 


Radial Distortion Coefficients bi 


bi 


b2 


bs 


64 


bs 


bs 


-30.4219 


458.6032 


-1240.1970 


1394.3862 


1003.5856 


-610.6167 


br 


bs 


bs 


bio 


bii 


bi2 


-1433.4416 


-1063.6945 


54.0374 


1359.5348 


2472.7284 


3225.6347 


Singnlar Values ua for Xe 


wi 


UJ2 


UJs 


CJ4 


UJS 


UJs 


23.6078 


17.0001 


9.9003 


5.6505 


2.7189 


1.3567 


Ld7 




LOg 


UJlO 


UJll 


UJ 12 


0.5264 


0.2489 


0.0770 


0.0336 


0.0071 


0.0030 


Singnlar Values u>i for Xr 




UJ2 


Uls 


C1J4 


U>s 


Uls 


29.7794 


10.8641 


3.6978 


1.0619 


0.2580 


0.0536 


L1J7 


^8 


U!g 


caio 


UJll 


UI 12 


0.0095 


0.0014 


0.0 


0.0 


0.0 


0.0 


Xc 


Vc 




0 V 


x'' 


0.0118 


-0.8273 


0.0091 


0.0031 


0.1188 




Fig. 11. Calibration experiment with spherical camera B. a) (left): The pinhole cali- 
bration points, as imaged by the synthetic camera, b) (center): The spherical points as 
detected from image in FigureO:. c) (right): The polynomial reconstrnction obtained 
for this set of calibration points. 
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inverse transformation is required and, in general, the inverse of a polynomial 
function cannot be found analytically. In light of this, we use the calibration 
parameters obtained during the modelling phase to: 

1. shift and rotate the planar image points to construct by Xp, 9^ and 9^ 
respectively; 

2. shift the detected spherical points by — Xc; 

and compute the polynomial coefficients of the inverse transformation as 

L L 

% = au9'ij and ^ (14) 

using a procedure identical to solving (0. The polynomials in d are the 
pseudo-inverses of and are used to remove radial and tangential distortions 
from spherical images. Figures El and El show the distortion removal on the cal- 
ibration images and on a typical stereo pair acquired with the spherical cameras. 
A lookup table without interpolation (linear or other) was used to implement 
the transformation. 

5.5 Image Processing Issues 

Removing distortions from spherical images is not as important as the trans- 
formation of image processing results into a perspective space. The advantages 
of such approaches are many. For instance, the costly transformation of com- 
plete image sequences is avoided; image processing algorithms directly applied 
to spherical images do not suffer from the noise introduced with the distortion 
removal process, and the results of image processing algorithms are generally 
more compact with respect to the original signal and hence faster to transform 
to a perspective space. 

6 Conclusion 

Spherical cameras are variable-resolution imaging systems that have been recog- 
nized as promising devices for autonomous navigation purposes, mainly because 
of their wide viewing angle which increases the capabilities of vision-based ob- 
stacle avoidance schemes. In addition, spherical lenses resemble the primate eye 
in their projective models and are biologically relevant. We presented a novel 
method for spherical-lens camera calibration which models the lens radial and 
tangential distortions and determines the optical center and the angular devia- 
tions of the CCD sensor array within a unified numerical procedure. Contrary 
to other methods, there is no need for special equipment such as low-power laser 
beams or non-standard numerical procedures for finding the optical center. Nu- 
merical experiments and robustness analyses are presented and the results have 
shown adequate convergence rates and resistance to input noise. The method was 
successfully applied to our pair of spherical cameras and allowed us to diagnose 
a severe CCD array misalignment of camera A. 



18 



S.S. Beauchemin and R. Bajcsy 




Fig. 12. Disrtortion removal from calibration images, (left): Camera A. (right): Cam- 
era B. 



A Point Detection Algorithm 

We use a calibration plane with a grid of n x n points (where n is odd) for the 
calibration process. Using a spherical camera perpendicular to the calibration 
plane, frames of the calibration points are acquired. In this section we describe 
the algorithm used to detect the calibration points on this spherical image. 

The grid points are numbered according to their position in the image plane 
coordinate system. The central point is poo, the points on the x-axis are defined 
from left to right by {Pio} where —m < i < m, m = and the points of the 
y-axis from bottom to top by {poj}, —m < j < m. Pij is the point that lies in 
the row and the column of the grid, relative to the origin. The value of Pij 
is a 2D vector of its centroid position or fail for a point that was not detected. 

An iterative algorithm is used to detect the grid points. In the first iteration 
{k = 0) the point at the center of the grid, poo, is detected. In the iteration, 
I < A: < 2m, all the points Pij such that |f| -I- |j| = k are found. The first step in 
detecting any grid point is defining an image pixel from which the search for this 
point is to begin. The initial pixel is used as an input to the detect procedure 
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Fig. 13. Disrtortion removal from typical images, (left): Camera A. (right): Camera 
B. 



which outputs the centroid of the requested grid point, or fail if the point is not 
found. 

The initial pixel for searching the central point is the pixel at the center of 
the image. For any other point, the positions of neighboring grid points that were 
detected in earlier iterations are used to define the initial pixel. When detecting 
a grid point p^o on the x-axis, the initial pixel depends on the location of p^'o 
which is the point next to Pio and closer to the center. The initial pixel in this 
case is calculated by adding to p^'o a vector c^/ with magnitude equal to the 
width of the grid point Pi'o directed from the center towards Pi'o. The initial 
pixel used for detecting points on the y-axis is calculated in a similar way. When 
detecting the point p^ in iteration k, the points Pi'j', Pij> and piij are already 
detected in iterations k — 1 and k — 2. We start the search for p^ from the pixel 
defined by p^/ + Pi>j — p^/j' (see figure El)- 

The detect procedure uses a threshold mechanism to separate the pixels that 
are within the grid points from the background pixels. Since the image contains 





20 



S.S. Beauchemin and R. Bajcsy 




Fig. 14. a) (left): Finding point pio based on Pi'o- b) (right): Finding point pij 
based on Pi'j', Pij' and Pi'j. The gray rectangle marks the initial pixel. 



Poo detect (0,0) 
for fc = 1 to 2m 

for each pij such that \i\ + = fc do 

i' = signii) ■ (|i| - 1) 
j' = sign{j) ■ (|j| - 1) 
if i = 0 then 

if Pi'o 7^ fail then 

Pio detect (pi/o + Ci>) 

else pio fail 

else if j = 0 then 

if poj' ^ fail then 

poj ■«— detect (poj/ + cy) 
else poj fail 

else if Pij! ,Piij,Piiji fail then 
Pij -s- detect (pij/ + Pi>j - Pi>j/) 
else Pij fail 



Fig. 15. Algorithm for detecting grid points on a spherical image. 



areas with different illumination levels, we use multi-level thresholding to detect 
the points in all areas of the image. 

We define an initial threshold level as the minimum gray level such that at 
least 4% of the image pixels are below the threshold. The detect procedure 
finds a pixel closest to the input pixel with a gray level that is lower than the 
defined threshold. It assumes that this pixel is contained within the grid point. 
If no such pixel is found, the threshold is increased and the search is repeated 
until such pixel is found or until the threshold gets the maximum gray value 
(white). In the later case the procedure returns fail. If a pixel with a low gray 
level is found, all the neighboring pixels with gray levels that are lower than the 
threshold are grouped to form a grid point. The smallest rectangle that bounds 
the grid point is found. The center of the grid point is the mean of the pixels 
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contained in the bounding rectangle calculated in the following way: let R be 
the bounding rectangle, where R = {{x,y)\x\ < x < X2 and yi < y < 2/2}, then 
the mean over the pixels in R is: 



M,{R) 



ElU, RC-li.,y)) 

Elly, 



My{R) 



EIL, Elly, 

Elly, 



(15) 



where I(a:, y) is the gray level of the pixel (x, y) and C is the maximum grayvalue. 
If the bounding rectangle contains more than just the grid point, which might 
be the case with a high threshold the procedure returns fail. 
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Abstract. Natural or artificial vision systems process the images that 
they collect with their eyes or cameras in order to derive information for 
performing tasks related to navigation and recognition. Since the way im- 
ages are acquired determines how difficult it is to perform a visual task, 
and since systems have to cope with limited resources, the eyes used by 
a specific system should be designed to optimize subsequent image pro- 
cessing as it relates to particular tasks. Different ways of sampling light, 
i.e., different eyes, may be less or more powerful with respect to partic- 
ular competences. This seems intuitively evident in view of the variety 
of eye designs in the biological world. It is shown here that a spherical 
eye (an eye or system of eyes providing panoramic vision) is superior to 
a camera-type eye (an eye with restricted field of view) as regards the 
competence of three-dimensional motion estimation. This result is de- 
rived from a statistical analysis of all the possible computational models 
that can be used for estimating 3D motion from an image sequence. The 
findings explain biological design in a mathematical manner, by showing 
that systems that fly and thus need good estimates of 3D motion gain ad- 
vantages from panoramic vision. Also, insights obtained from this study 
point to new ways of constructing powerful imaging devices that suit 
particular tasks in robotics, visualization and virtual reality better than 
conventional cameras, thus leading to a new camera technology. 



When classifying eye designs in biological systems, one can differentiate between 
the different ways of gathering light at the retina, whether single or multiple 
lenses are used, the spatial distribution of the photoreceptors, the shapes of 
the imaging surfaces, and what geometrical and physical properties of light are 
measured (frequency, polarization) . A landscape of eye evolution is provided by 
Michael Land in P|. Considering evolution as a mountain, with the lower hills 
representing earlier steps in the evolutionary ladder, and the highest peaks repre- 
senting later stages of evolution, the situation is pictured in Fig. !□ At the higher 
levels of evolution one finds the compound eyes of insects and crustaceans and 
the camera-type eyes such as the corneal eyes of land vertebrates and fish. These 
two categories constitute two fundamentally different designs. Fundamental dif- 
ferences also arise from the positions in the head where camera-type eyes are 
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Fig. 1. Michael Land’s landscape of eye evolution (from |^) 



placed, for example, close to each other as in humans and primates, or on oppo- 
site sides of the head as in birds and fish, providing panoramic vision. It appears 
that the eyes of an organism evolve in a way that best serves that organism in 
carrying out its tasks. Thus, the success of an eye design should not be judged in 
an anthropicanic manner, i.e., by how accurately it forms an image of the outside 
world; rather, it should be judged in a purposive sense. A successful eye design is 
one that makes the performance of the visual tasks a system is confronted with 
as easy as possible (fast and robust) PS|- The discovery of principles relating eye 
design to system behavior will shed light on the problem of evolution in general, 
and on the structure and function of the brain in particular. At the same time, it 
will contribute to the development of alternative camera technologies; cameras 
replace eyes in artificial systems and different camera designs will be more or 
less appropriate for different tasks. Cameras used in alarm systems, inspection 
processes, virtual reality systems and human augmentation tasks need not be 
the same; they should be designed to facilitate the tasks at hand. This paper 
represents a first effort to introduce structure into the landscape of eyes as it 
relates to tasks that systems perform. 

Although the space of tasks or behaviors performed by vision systems is 
difficult to formalize, there exist a few tasks that are performed by the whole 
spectrum of vision systems. All systems with vision move in their environments. 
As they move, they need to continuously make sense of the moving images they 
receive on their retinae and they need to solve problems related to navigation; 
in particular, they need to know how they themselves are moving i: I23I- 
Inertial sensors can help in this task, but it is vision that can provide accurate 
answers. Regardless of the way in which a system moves (walks, crawls, flies, 
etc.), its eyes move rigidly. This rigid motion can be described by a translation 
and a rotation; knowing how a system moves amounts to knowing the parameters 
describing its instantaneous velocity. This is not to say, of course, that a vision 
system has an explicit representation of the parameters of the rigid motion that 
its eyes undergo. This knowledge could be implicit in the circuits that perform 



24 



C. Fermuller and Y. Aloimonos 



specific tasks, such as stabilization, landing, pursuit, etc. H, m.m, m, but 
successful completion of navigation-related tasks presupposes some knowledge of 
the egomotion parameters or subsets of them. Thus, a comparison of eyes with 
regard to egomotion estimation should lead to a better understanding of one of 
the most basic visual competences. 

Two fundamentally different eye designs are compared here, a spherical eye 
and a planar, camera-type eye (Fig.|2|). Spherical eyes model the compound eyes 
of insects, while planar eyes model the corneal eyes of land vertebrates as well as 
fish. In addition, the panoramic vision of some organisms, achieved by placing 
camera-type eyes on opposite sides of the head, is approximated well by a spher- 
ical eye. The essential difference between a spherical and a planar eye lies in the 
field of view, 360 degrees in the spherical case and a restricted field in the pla- 
nar case. The comparison performed here demonstrates that spherical eyes are 
superior to planar eyes for 3D motion estimation. “Superior” here means that 
the ambiguities inherent in deriving 3D motion from planar image sequences are 
not present in the spherical case. Specifically, a geometrical/statistical analysis is 
conducted to investigate the functions that can be used to estimate 3D motion, 
relating 2D image measurements to the 3D scene. These functions are expressed 
in terms of errors in the 3D motion parameters and they can be understood as 
multi-dimensional surfaces in those parameters. 3D motion estimation amounts 
to a minimization problem; thus, our approach is to study the relationships 
among the parameters of the errors in the estimated 3D motion at the minima 
of the surfaces, because these locations provide insight into the behaviors of the 
estimation procedures. It is shown that, at the locations of the minima, the er- 
rors in the estimates of both the translation and rotation are non-zero in the 
planar case, while in the spherical case either the translational or rotational error 
becomes zero. Intuitively, with a camera-type eye there is an unavoidable con- 
fusion between translation and rotation, as well as between translational errors 
and the actual translation. This confusion does not occur with a spherical eye. 
The implication is that visual navigation tasks involving 3D motion parameter 
estimation are easier to solve with spherical eyes than with planar eyes. 

The basic geometry of image motion is well understood. As a system moves 
in its environment, every point of the environment has a velocity vector relative 
to the system. The projections of these 3D velocity vectors on the retina of the 
system’s eye constitutes the motion field. For an eye moving with translation 
t and rotation a; in a stationary environment, each scene point R = (A, Y, Z) 
measured with respect to a coordinate system OXY Z fixed to the nodal point 
of the eye has velocity R = — t — u; x R. Projecting R onto a retina of a given 
shape gives the image motion field. If the image is formed on a plane (Fig. Eli) 
orthogonal to the Z axis at distance / (focal length) from the nodal point, then 
an image point r = {x, y, f) and its corresponding scene point R are related by 
r = R, where Zn is a unit vector in the direction of the Z axis. The motion 
field becomes 



(R^Zo) ^^° X (t X r)) -H y zo X (r X (w X r)) = ^ Utr(t) -k Urot(‘^), (1) 
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Fig. 2. Image formation on the sphere (a) and on the plane (b). The system moves 
with a rigid motion with translational velocity t and rotational velocity u). Scene points 
R project onto image points r and the 3D velocity R of a scene point is observed in 
the image as image velocity r 



with Z = R ■ zq representing the depth. If the image is formed on a sphere of 
radius / (Fig. Eb) having the center of projection as its origin, the image r of 
any point R is r = with R being the norm of R (the range), and the image 
motion is 



1^ ((t • r) r - t) - u; X r = ^ Utr(t) + u„t(w). 



( 2 ) 



The motion field is the sum of two components, one, Utr, due to translation 
and the other, Uj-otj due to rotation. The depth Z or range i? of a scene point 
is inversely proportional to the translational flow, while the rotational flow is 
independent of the scene in view. As can be seen from (0 and @, the effects of 
translation and scene depth cannot be separated, so only the direction of trans- 
lation, t/|t|, can be computed. We can thus choose the length of t; throughout 
the following analysis / is set to 1, and the length of t is assumed to be 1 on 
the sphere and the .Z-component of t to be 1 on the plane. The problem of 
egomotion then amounts to finding the scaled vector t and the vector cj from a 
representation of the motion held. 

To set up mathematical formulations for 3D motion estimation, the follow- 
ing questions should be answered. The first question to be addressed is, what 
description containing information about 3D motion does a system use to repre- 
sent the image sequence? One might envision a sophisticated system that could 
attempt to estimate the motion held, termed the optic flow field ESI- On the 
other hand, it is also easy to envision a system that does not have the capacity 
to estimate the motion held, but only to obtain a partial description of it. An 
example of a description containing minimal information about image motion is 
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the normal motion field. This amounts to the projection of the motion field onto 
the direction of the image gradient at each point, and represents the movement 
of each local edge element in the direction perpendicular to itself. Normal flow 
can be estimated from local spatiotemporal information in the image |22|. |2d|. 
m, m If n is a unit vector at an image point denoting the orientation of the 
gradient at that point, the normal flow satisfies 

Vn = r-n. (3) 

Unlike normal flow, the estimation of optic flow is a difficult problem because 
information from different image neighborhoods must be compared and used in 
a smoothing scheme to account for discontinuities ^Hj, Although it is not 
yet known exactly what kinds of image representations different visual systems 
recover, it is clear that such descriptions should lie somewhere between normal 
flow fields and optic flow fields. Thus, when comparing eye designs with regard 
to 3D motion estimation, one must consider both kinds of flow fields. 

The second question to be addressed is, through what geometric laws or 
constraints is 3D motion coded into image motion? The constraints are easily 
observed from (EHSl). Equations and @ show how the motions of image 
points are related to 3D rigid motion and to scene depth. By eliminating depth 
from these equations, one obtains the well known epipolar constraint |19|: for 
both planar and spherical eyes it is 

(t X r) • (r + a; X r) = 0. (4) 

Equating image motion with optic flow, this constraint allows for the derivation 
of 3D rigid motion on the basis of optic flow measurements. One is interested in 
the estimates of translation t and rotation u> which best satisfy the epipolar con- 
straint at every point r according to some criterion of deviation. The Euclidean 
norm is usually used, leading to the minimization mi- izri of the functiorQ 

Mep = J J [{i X r) ■ {r + IV X r)]'^ dr. (5) 

image 



On the other hand, if normal flow is given, the vector equations (pQ) and 0 
cannot be used directly. The only constraint is scalar equation m, along with 
the inequality Z > 0 which states that since the surface in view is in front of the 
eye its depth must be positive. Substituting m or O into (j3|) and solving for 
the estimated depth Z or range R, we obtain for a given estimate t,o> at each 
point r: 



Z{oT R) 



utr(t) ■ n 

(r - Urot(d;)) • n' 



( 6 ) 



^ Because t x r introduces the sine of the angle between t and r, the minimization 
prefers vectors t close to the center of gravity of the points r. This bias has been rec- 
ognized m and alternatives have been proposed that reduce this bias, but without 
eliminating the confusion between rotation and translation. 
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If the numerator and denominator of ® have opposite signs, negative depth 
is computed. Thus, to utilize the positivity constraint one must search for the 
motion t,Cj that produces a minimum number of negative depth estimates. For- 
mally, if r is an image point, define the indicator function 

T iri = / ^ (utr(t) • n) (r - Urot(‘i)) < 0 

\ 0 for (utr(t) • n) (r - Urot(‘^)) > 0 ' 

Then estimation of 3D motion from normal flow amounts to minimizing 
Pg the function 

Mnd = J J ^nd{r)dr. 

image 

Expressing r in terms of the real motion from and (0, functions o 
and o can be expressed in terms of the actual and estimated motion parame- 
ters t, iij, t and uj (or, equivalently, the actual motion parameters t,u; and the 
errors = t — t, = a; — d>) and the depth Z (or range R) of the viewed scene. 
To conduct any analysis, a model for the scene is needed. We are interested in 
the statistically expected values of the motion estimates resulting from all possi- 
ble scenes. Thus, as our probabilistic model we assume that the depth values of 
the scene are uniformly distributed between two arbitrary values Zmin(or i?min) 
and Zmax(or i?max) (0 < Znun < ^max)- For the minimization of negative depth 
values, we further assume that the directions in which flow measurements are 
made are uniformly distributed in every direction for every depth. Parameteriz- 
ing n by '0, the angle between n and the x axis, we thus obtain the following 
two functions: 



( 7 ) 



E = 

-^ep 



MepdZ, 



( 8 ) 






” ■^max 

End= j j MnddZdip, (9) 



ljj — 0 Z—Zrr 



measuring deviation from the epipolar constraint and the amount of negative 
depth, respectively. Functions 0 and (0 are five-dimensional surfaces in 
the errors in the motion parameters. 

We are interested in the topographic structure of these surfaces, in particular, 
in the relationships among the errors and the relationships of the errors to the 
actual motion parameters at the minima of the functions. The idea behind this is 
that in practical situations any estimation procedure is hampered by errors and 
usually local minima of the functions to be minimized are found as solutions. 

Independent of the particular algorithm, procedures for estimating 3D motion 
can be classified into those estimating either the translation or rotation as a first 
step and the remaining component (that is, the rotation or translation) as a 
second step, and those estimating all components simultaneously. Procedures of 
the former kind result when systems utilize inertial sensors which provide them 
with estimates of one of the components, or when two-step motion estimation 
algorithms are used. 
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Thus, three cases need to be studied: the case were no prior information 
about 3D motion is available and the cases where an estimate of translation or 
rotation is available with some error. Imagine that somehow the rotation has 
been estimated, with an error Then our functions become two-dimensional 
in the variables tg and represent the space of translational error parameters 
corresponding to a fixed rotational error. Similarly, given a translational error 
tg, the functions become three-dimensional in the variables and represent the 
space of rotational errors corresponding to a fixed translational error. To study 
the general case, one needs to consider the lowest valleys of the functions in 2D 
subspaces which pass through 0. In the image processing literature, such local 
minima are often referred to as ravine lines or courses H Each of the three cases 
is studied for four optimizations: epipolar minimization for the sphere and the 
plane and minimization of negative depth for the sphere and the plane. Thus, 
there are twelve (four times three) cases, but since the effects of rotation on the 
image are independent of depth, it makes no sense to perform minimization of 
negative depth assuming an estimate of translation is available. Thus, we are left 
with ten different cases which are studied below. These ten cases represent all 
the possible, meaningful motion estimation procedures on the plane and sphere. 

Epipolar Minimization on the Plane. Denote estimated quantities by letters with 
hat signs, actual quantities by unmarked letters, and the differences between 
actual and estimated quantities (the errors) by the subscript “e.” Furthermore, 
let t = { xq , 2 / 0 ) 1) and u) — (a, /3, 7 ). Since the field of view is small, the quadratic 
terms in the image coordinates are very small relative to the linear and constant 
terms, and are therefore ignored. 

Considering a circular aperture of radius e, setting the focal length / = 1, 
W = 1 and W = 1, the function in Q becomes 

^max 6 27T 

Eep= J J J - (3, + + x'^ (y-yo) 

Z=Zrain r=0 (p=0 

(x — xq)^ ^drd(j>dZ 

where (x, </)) are polar coordinates (x = r cos (/),// = rsin^). Performing the 
integration, one obtains 

Eep = Tre^ ^(Zmax “ ^min) + \ (je (^0 + ^o) + {Xode + IJofde) + 

^ One may wish to study the problem in the presence of noise in the flow measurements 
and derive instead the expected values of the local and global minima. It has been 
shown, however, that noise which is of no particular bias does not alter the local 
minima, and the global minima fall within the valleys of the function without noise. 
In particular, we considered in |J] noise N of the form N = + 3, with e,5 2D, 

independent, stochastic error vectors. As such noise does not alter the function’s 
overall structure, it won’t be considered here; the interested reader is referred to |3 . 



2 / - 2/0 



+ ae-JeX + y 
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+ + (In (^max) ~ In (^min)) Q(37£(2;0e2/0 

- yo,xo) + xo,f3e - yo,OL^)e^ + 2 {xo,yo ~ yo,xo) {xoa^ + yo^e) ^ 

+ ~ z~) + + {xo,yo-yo,xof'^'^ ( 10 ) 

(a) Assume that the translation has been estimated with a certain error tg = 
{xot , uq^ , 0) . Then the relationship among the errors in 3D motion at the minima 
of (1 1 1 )ll is obtained from the first-order conditions = 0, 

which yield 

yOe (In (^max) In (-^min)) ^ Xq (In (^niax) In (-^min)) ^ 

«£ = ^ ; be = ^ ; 7e = 0 

■^max ^min ■^max ^min 

( 11 ) 

It follows that ae//3e = —xoJyo^,je = 0, which means that there is no error in 
7 and the projection of the translational error on the image is perpendicular to 
the projection of the rotational error. This constraint is called the “orthogonality 
constraint.” 



(b) Assuming that rotation has been estimated with an error (oe, /3e, 7^), the 
relationship among the errors is obtained from = 0. In this case, 

the relationship is very elaborate and the translational error depends on all the 
other parameters — that is, the rotational error, the actual translation, the image 
size and the depth interval. 



(c) In the general case, we need to study the subspaces in which Eep changes 
least at its absolute minimum; that is, we are interested in the direction of the 
smallest second derivative at 0, the point where the motion errors are zero. To 
find this direction, we compute the Hessian at 0, that is the matrix of the second 
derivatives of E^p with respect to the five motion error parameters, and compute 
the eigenvector corresponding to the smallest eigenvalue. The scaled components 
of this vector amount to 



3 : 0 , =^0 yo, = yo Pe = -Oie^ 7e = 0 
— 2yp,Z^n7in.Z^niax (1^ (-^max) l^(-^min)) / 



^ (■^max ■^min) (^max'^min 1) T ^ (-^max -^min) (•^max-^min 1) 



(l'^ (-^max) ln(.^min)) 



As can be seen, for points defined by this direction, the translational and rota- 
tional errors are characterized by the orthogonality constraint a^j (3e = —Xo^/yo^ 
and by the constraint xq/uq = Xo/yo; that is, the projection of the actual transla- 
tion and the projection of the estimated translation lie on a line passing through 
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the image center. We refer to this second constraint as the “line constraint.” 
These results are in accordance with previous studies which found that 

the translational components along the x and y axes are confused with rotation 
around the y and x axes, respectively, and the “line constraint” under a set of 
restrictive assumptions. 

Epipolar Minimization on the Sphere. The function representing deviation from 
the epipolar constraint on the sphere takes the simple form 

^max 2 

Eep= j y y I ^ ■ (* >< oj dAtm 

Rmin sphere 

where A refers to a surface element. Due to the sphere’s symmetry, for each point 
r on the sphere, there exists a point with coordinates — r. Since Utr(r) = Utr(— r) 
and Urot(r) = — Ui-ot(— r)> when the integrand is expanded the product terms 
integrated over the sphere vanish. Thus 

Eep= J yy + ((we X r) • (t X r))^| dAdi? 

(a) Assuming that translation t has been estimated, the that minimizes E^p is 
= 0, since the resulting function is non-negative quadratic in (minimum at 
zero). The difference between sphere and plane is already clear. In the spherical 
case, as shown here, if an error in the translation is made we do not need to 
compensate for it by making an error in the rotation = 0), while in the 
planar case we need to compensate to ensure that the orthogonality constraint 
is satisfied! 



(b) Assuming that rotation has been estimated with an error what is the 
translation t that minimizes Egpl Since R is uniformly distributed, integrating 
over R does not alter the form of the error in the optimization. Thus, E^p consists 
of the sum of two terms: 



K = 



// ((txt) •] 



■fdA 



and 



sphere 




sphere 



^ 2 

((u;,; X r) • (t X r)) dA, 



where Ki,Li are multiplicative factors depending only on i?min and i?max- For 
angles between t,t and t,o;e in the range of 0 to tt/2, K and L are monotonic 
functions. K attains its minimum when t = t and L when t T a;^. Consider a 
certain distance between t and t leading to a certain value K, and change the 
position of t. L takes its minimum when (t x t) ’ = Oj as follows from the 

cosine theorem. Thus Egp achieves its minimum when t lies on the great circle 
passing through t and We, with the exact position depending on jwjl and the 
scene in view. 
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(c) For the general case where no information about rotation or translation is 
available, we study the subspaces where ifep changes the least at its absolute 
minimum, i.e., we are again interested in the direction of the smallest second 
derivative at 0. For points defined by this direction we calculate t = t and 

_L t. 



To study the negative depth values described by function O a more geometric 
interpretation is needed. Substituting into the value of r from m or 0 gives 



Z{oT R) 



Utr(t) ■ n 

(z(Or R) ~ ’ n 



This equation shows that for every n and r a range of values for Z (or R) is 
obtained which result in negative estimates of Z (or R). Thus for each direction 
n, considering all image points r, we obtain a volume in space corresponding to 
negative depth estimates. The sum of all these volumes for all directions is termed 
the “negative depth” volume, and calculating 3D motion in this case amounts 
to minimizing this volume. Minimization of this volume provides conditions for 
the errors in the motion parameters. 



Minimizing Negative Depth Volume on the Plane. This analysis is given in |0|. 

The findings are summarized here: 

(a) Assume that rotation has been estimated with an error Then 

the error that minimizes the negative depth volume satisfies the 

orthogonality constraint xojyo^ = —PejcXe. 

(b) In the absence of any prior information about the 3D motion, the solution 
obtained by minimizing the negative depth volume has errors that satisfy 
the orthogonality constraint xo^/yo^ = —Pejcte, the line constraint xo/yo = 
xo/yo and = 0 



Minimizing Negative Depth Volume on the Sphere 

(a) Assuming that the rotation has been estimated with an error what is 
the optimal translation t that minimizes the negative depth volume? 

Since the motion field along different orientations n is considered, a param- 
eterization is needed to express all possible orientations on the sphere. This is 
achieved by selecting an arbitrary vector s; then, at each point r of the sphere, 
defines a direction in the tangent plane. As s moves along half a circle, 

takes on every possible orientation (with the exception of the points r 
lying on the great circle of s). Let us pick perpendicular to s (s • = 0). 

We are interested in the points in space with estimated negative range 
values R. Since n = s ■ uj^ = 0, the estimated range R amounts to 

^ = ^ {txsyr-T{lo%)(s.r} - R < Oifsgn[(txs)-r] = -sgn[(t xs)-r-i?(a;,-r)(s-r)], 
where sgn(a:) provides the sign of x. This constraint divides the surface of the 
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S 





area 


location 


constraint on R 


I 


sgn(t X s) • r = sgn(t x s) • r = sgn(r • c<^e)(r • s) 


(t X s) • r 


(r-tj,)(r-s) 


II 


— sgn(t X s) • r = sgn(t x s) • r = sgn(r • uje)(r ■ s) 


all |R| 


III 


sgn(t X s) • r = — sgn(t x s) • r = sgn(r • uje)(r ■ s) 


(t X s) • r 


" (r ■ tJ,)(r ■ s) 


IV 


sgn(t X s) ■ r = sgn(t x s) ■ r = — sgn(r ■ uje)(r ■ s) 


none 



Fig. 3. Classification of image points according to constraints on R. The four areas are 
marked by different colors. The textured parts (parallel lines) in areas I and III denote 
the image points for which negative depth values exist if the scene is bounded. The two 
hemispheres correspond to the front of the sphere and the back of the sphere, both as 
seen from the front of the sphere 



sphere into four areas, I to IV, whose locations are defined by the signs of the 
functions (t x s) • r, (t x s) • r and (a>e • r)(s • r), as shown in Fig. 0 

For any direction n a volume of negative range values is obtained consisting 
of the volumes above areas I, II and III. Areas II and III cover the same amount 
of area between the great circles (t x s) - r = 0 and (t x s) - r = 0, and area I covers 
a hemisphere minus the area between (t x s) • r = 0 and (t x s) • r = 0. If the scene 
in view is unbounded, that is, R € [0, +oo], there is for every r a range of values 
above areas I and III which result in negative depth estimates; in area I the 
volume at each point r is bounded from below hy R — area III 

it is bounded from above by i? = . If there exist lower and upper bounds 

Rmin and i?max in the scene, we obtain two additional curves Cmin and Cmax with 
Cmin = (txs)-r-i?„iin(u)e-r)(s-r) = OandCmax = (txs)-r-ii max (a;,-r)(s-r) = 0, 
and we obtain negative depth values in area I only between Cmax and (t xs)-r = 0 
and in area III only between Cmin and x r)(s x r) = 0. We are given and 
t, and we are interested in the t which minimizes the negative range volume. 
For any s the corresponding negative range volume becomes smallest if t is on 
the great circle through t and s, that is, (t x s) • t = 0, as will be shown next. 
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S 




Fig. 4. Configuration for t and t on the great circle of s and oJe perpendicular to s. 
The textured part of area I denotes image points for which negative depth values exist 
if the scene is bounded 



Let us consider a t such that (t x s) • t 0 and let us change t so that 

(t X s) • t = 0. As t changes, the area of type II becomes an area of type IV 
and the area of type III becomes an area of type I. The negative depth volume 
is changed as follows: It is decreased by the spaces above area II and area III, 
and it is increased by the space above area I (which changed from type III to 
type I). Clearly, the decrease is larger than the increase, which implies that the 
smallest volume is obtained for s,t,t lying on a great circle. Since this is true 
for any s, the minimum negative depth volume is attained for t = t.@ 

(b) Next, assume that no prior knowledge about the 3D motion is available. We 
want to know for which configurations of t and the negative depth values 
change the least in the neighborhood of the absolute minimum, that is, at = 
= 0. From the analysis above, it is known that for any 0, t = t. Next, 
we show that is indeed different from zero: Take t t on the great circle of 
s and let as before, be perpendicular to s. 

Since (t x s) x = 0, the curves Cmax and Cmin can be expressed as 

C'max(min) = ) ( |o;" | ’ r)) = 0, where sinZ(t,s) denotes the 

angle between vectors t and s. These curves consist of the great circle • r = 0 
and the circle ^ — (s • r) =0 parallel to the great circle (s • r) = 0 (see 

Fig.H). If I, — > 1, this circle disappears. 

Consider next two flow directions defined by vectors Si and S 2 with (si x t) = 
— (s 2 X t) and Si between t and t. 

^ A word of caution about the parameterization used for directions n = |jfy7|| is 
needed. It does not treat all orientations equally (as s varies along a great circle 
with constant speed, s x r accelerates and decelerates). Thus to obtain a uniform 
distribution, normalization is necessary. The normalization factors, however, do not 
affect the previous proof, due to symmetry. 
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For every point ri in area III defined by Si there exists a point r 2 in area I 
defined by S 2 such that the negative estimated ranges above ri and r 2 add up 
to i?max ~ -Rmin- Thus the volume of negative range obtained from Si and S 2 
amounts to the area of the sphere times (i?max — ^min) (area II of Si contributes 
a hemisphere; area III of Si and area I of S 2 together contribute a hemisphere). 
The total negative range volume can be decomposed into three components: a 
component Vi originating from the set of s between t and t, a component 1^2 
originating from the set of s symmetric in t to the set in Vi, and a component 
V 3 corresponding to the remaining s, which consists of range values above areas 
of type I only. If for all s in I/ 3 , ^ | > 1, V 3 becomes zero. Thus for all |a;e| 

with t ]30 negative range volume is equally large and amounts 

to the area on the sphere times (i?max — .Rmin) times Z(t,t). Unless i?max = 00 , 
jojel takes on values different from zero. 

This shows that for any tg 0, there exist vectors Wg yZ 0 which give rise 
to the same negative depth volume as Wg = 0. However, for any such Wg yZ 0 
this volume is larger than the volume obtained by setting tg = 0. It follows 
that t = t. From Fig. 0 it can furthermore be deduced that for a given Wg the 
negative depth volume, which for t = t only lies above areas of type I, decreases 
as t moves along a great circle away from u)g, as the areas between Cmin and 
C'max and between Cmin and (t x s) • r = 0 decrease. This proves that in addition 
to t = t, t T Wg. 

The preceding results demonstrate the advantages of spherical eyes for the 
process of 3D motion estimation. Table 0] lists the eight out of ten cases which 
lead to clearly defined error configurations. It shows that 3D motion can be esti- 
mated more accurately with spherical eyes. Depending on the estimation proce- 
dure used — and systems might use different procedures for different tasks — either 
the translation or the rotation can be estimated very accurately. For planar eyes, 
this is not the case, as for all possible procedures there exists confusion between 
the translation and rotation. The error configurations also allow systems with 
inertial sensors to use more efficient estimation procedures. If a system utilizes a 
gyrosensor which provides an approximate estimate of its rotation, it can employ 
a simple algorithm based on the negative depth constraint for only translational 
motion fields to derive its translation and obtain a very accurate estimate. Such 
algorithms are much easier to implement than algorithms designed for com- 
pletely unknown rigid motions, as they amount to searches in 2D as opposed to 
5D spaces 0. Similarly, there exist computational advantages for systems with 
translational inertial sensors in estimating the remaining unknown rotation. 

In nature, systems that walk and perform sophisticated manipulation have 
camera-type eyes, and systems that fly usually have panoramic vision, either 
through compound eyes or a combination of camera- type eyes. The obvious ex- 
planation for this difference is the need for a larger field of view in flying species, 
and the need for very accurate segmentation and shape estimation, and thus high 
resolution in a limited field of view, for land- walking species. As shown in this 
paper, the geometry of the sphere also provides a computational advantage; it 
allows for more efficient and accurate egomotion estimation (even at the expense 
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Table 1. Summary of results 



I II 





Spherical Eye 


Camera-type Eye 


Epipolar mini- 

mization, given 
optic flow 


(a) Given a translational er- 
ror te, the rotational error 
LJe = 0 

(b) Without any prior informa- 
tion, te = 0 and oje T t 


(a) For a fixed translational er- 

ror (a;Oe,yOe), the rotational 
error (oe, dei 7e) is of the 
form 7 e = 0, ae/de = 

-SOe/yOe 

(b) Without any a priori infor- 
mation about the motion, 
the errors satisfy 7 e = 0, 

Oe/de = -a:0e/j/0e, Xo/yo = 

XoJvo, 


Minimization of 
negative depth 
volume, given 
normal flow 


(a) Given a rotational error 
o;e, the translational error 
t, = 0 

(b) Without any prior informa- 
tion, te = 0 and oje T t 


(a) Given a rotational error, 
the translational error is of 
the form — XOe/yOe = Q-eIPe 

(b) Without any error infor- 
mation, the errors satisfy 

7e = 0, Ue/Pe = -XOe/VOe, 

xo/yo = xoJyoE 



of trading off resolution in some systems, for example, in insects), and this is 
much more necessary for systems flying and thus moving with all six degrees of 
freedom than for systems moving with usually limited rigid motion on surfaces. 

The above results also point to ways of constructing new, powerful eyes by 
taking advantage of both the panoramic vision of flying systems and the high- 
resolution vision of primates. An eye like the one in Fig. 0 assembled from a 
few video cameras arranged on the surface of a sphereQ can easily estimate 3D 
motion since, while it is moving, it is sampling a spherical motion field! Even 
more important for today’s applications is the reconstruction of the shape of an 
object or scene in a very accurate manner. Accurate shape models are needed in 
many applications dealing with visualization, as in video editing/manipulation 
or in virtual reality settings HE], [n|. To obtain accurate shape reconstruction, 
both the 3D transformation relating two views and the 2D transformation re- 
lating two images are needed with good precision. Given accurate 3D motion 
(t,ci;) and image motion (r), iPJE]) can be used in a straightforward manner to 
estimate depth {Z) or range (i?) and thus object shape. An eye like the one in 
Fig. □ not only has panoramic properties, eliminating the rotation/translation 

Like a compound eye with video cameras replacing ommatidia. 
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Fig. 5. A compound-like eye composed of conventional video cameras, arranged on a 
sphere and looking outward 



confusion, but it has the unexpected benefit of making it easy to estimate image 
motion with high accuracy. Any two cameras with overlapping fields of view 
also provide high-resolution stereo vision, and this collection of stereo systems 
makes it possible to locate a large number of depth discontinuities. Given scene 
discontinuities, image motion can be estimated very accurately 0. As a conse- 
quence, the eye in Fig. 0 is very well suited to developing accurate models of 
the world, and many experiments have confirmed this finding. However, such an 
eye, although appropriate for a moving robotic system, may be impractical to 
use in a laboratory. Fortuitously from a mathematical viewpoint, it makes no 
difference whether the cameras are looking inward or outward! 

Consider, then, a “negative” spherical eye like the one in Fig.0 where video 
cameras are arranged on the surface of a sphere pointing toward its center. Imag- 
ing a moving rigid object at the center of the sphere creates image motion fields 
at the center of each camera which are the same as the ones that would be cre- 
ated if the whole spherical dome were moving with the opposite rigid motion! 
Thus, utilizing information from all the cameras, the 3D motion of the object 
inside the sphere can be accurately estimated, and at the same time accurate 
shape models can be obtained from the motion field of each camera. The nega- 
tive spherical eye also allows for accurate recovery of models of action, such as 
human movement, because putting together motion and shape, sequences of 3D 
motion fields representing the motion inside the dome can be estimated. Such 
action models will find many applications in telereality, graphics and recogni- 
tion. The above described configurations are examples of alternative sensors, 
and they also demonstrate that multiple-view vision has great potential. Dif- 
ferent arrangements best suited for other problems can be imagined. This was 



Geometry of Eye Design: Biology and Technology 



37 




Fig. 6. A “negative” spherical eye, consisting of conventional video cameras arranged 
on a sphere and pointing inward 



perhaps foreseen in ancient Greek mythology, which has Argus, the hundred-eyed 
guardian of Hera, the goddess of Olympus, defeating a whole army of Cyclopes, 
one-eyed giants! 
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Abstract. This paper proposes a new and general model of panoramic images, 
namely polycentric panoramas, which formalizes the essential characteristics of 
panoramic image acquisition geometry. This new model is able to describe a 
wide range of panoramic images including those which have been previously 
introduced such as single-center, multi-perspective, or concentric panoramas |Q 
mrrm and that are potentially of interest in further research. This paper presents 
a study of epipolar geometry for pairs of polycentric panoramas. The first and 
unique epipolar curve equation derived provides a unified approach for computing 
epipolar curves in more specific types of panoramic images. Examples of epipolar 
curves in different types of panoramic images are also discussed in the paper. 



1 Introduction 



A panoramic image can be acquired by rotating a (slit) camera with respect to a fixed 
rotation axis and taking images consecutively at equidistant angles. This paper only dis- 
cusses panoramic images in cylindrical representation. The panoramic image acquisition 
model has been formally discussed in 03- There are three essential parameters in this 
image acquisition model: /, r, and uj, where / is the camera effective focal length, r is 
the distance between the camera’s focal point and the rotation axis, and uj specifies the 
orientation of the camera (see more details in the next section). Polycentric panoramas 
are a collection of panoramic images acquired with respect to different rotation axes, 
where the associated parameters for each image may differ. Note that r can be either 
greater than or equal to zero. 

Multi-perspective panoramic images have recently received increasing attentions for 
applications of 3D scene visualizations and reconstructions, for instance, ILfiU 111 ZlTH 
□m. Polycentric panoramas are able to describe a wide range of multi-perspective 
panoramic images such as concentric and single-center panoramas. A collection of 
(multi-perspective) panoramic images all acquired with respect to the same rotation 
axis is referred to as a set of concentric panoramic images. H-Y. Shum and R. Szeliski 
na have shown that epipolar geometry consists of horizontal lines if two concentric 
panoramic images are symmetric with respect to the camera viewing direction, which 
is when the associated angular parameters of these two panoramic images are uj and -uj 
respectively. A panoramic image acquired with a single focal point, i.e. r = 0, is referred 
to as a single-center panoramic image tTMItil . A study about epipolar curves in a pair 
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of single-center panoramic images can be found in [Q. Some examples of stereo recon- 
structions and 3D scene visualizations based on a given set of single-center panoramic 
images applications can be found in HITl . 

Geometric studies such as epipolar geometry or 3D reconstruction, are well estab- 
lished for pairs of planar images fmmm . Compared to that, the computer vision 
literature still lacks work on pairs of panoramic images. Due to differences in image ge- 
ometry between planar and a panoramic image models, geometric properties for planar 
images may not necessarily be true for panoramic images. In this paper, we focus on the 
derivation of an epipolar curve equation for a pair of polycentric panoramic images. The 
epipolar curve equation derived provides a unified approach for the epipolar geometry 
study in any of the more specific classes of panoramic images above-mentioned. 

The paper is organized as follows. The acquisition model of polycentric panoramic 
images is given in Section 2. The derivation of the epipolar curve equation through 
various geometric transformations is elaborated in Section 3. In Section 4, the epipolar 
curves in some special cases of panoramic images are presented mathematically and 
graphically. Concluding remarks and comments on future work are in Section 5. 



2 Image Acquisition Model 

Polycentric panoramic images can be acquired by different imaging methods. One of 
the possible ways is using a slit camera. A slit camera is characterized geometrically 
by a single focal point and a ID linear image sliQ. Ideally, the focal point lies on the 
bisector of an image slit. The focal point of a slit camera is denoted as C, the effective 
focal length is denoted as /, and the slit image captured is denoted as I. 

To acquire a panoramic image, a slit camera rotates with respect to a fixed 3D axis 
(e.g. the rotation axis of a turntable) and captures one slit image for every subsequent 
angular interval of constant size. We assume that the distance between the slit camera’s 
focal point and the rotation axis, r, remains constant during such a panoramic image 
acquisition process. We further assume that these focal points are in a single plane 
(exactly) orthogonal to the rotation axis, i.e. they lie on the circle £ of radius r. We call 
this circle the base circle. It follows that the optical axis of a slit camera is always in the 
plane of this base circle. We call this plane the base plane and denote it as S. 

Each slit image contributes to one column of a panoramic image. A panoramic 
image, denoted as V, can be considered as being a planar hx i rectangular array, where 
h specifies the resolution of the slit camera and i specifies the number of slit images 
acquired for one panoramic image. 

Besides of parameters / and r, our acquisition model allows a slit camera to have one 
more degree of freedom: a horizontal rotation parallel to the plane of the base circle. It 
is specified by an angle, lo, between the normal vector of the base circle at the associated 
focal point and the optical axis of the slit camera. Altogether, these three parameters, 
/, r, and cu, are the essential parameters characterizing a single panoramic image. They 
remain constant throughout one acquisition process for such a panoramic image. 

* An image slit is defined by a line segment and the receptors (i.e. photon-sensing elements) 
positioned on this line segment. 
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Fig. 1. Slit-image and slit-camera coordinate systems. An image point p on the image slit can be 
represented either by an angular coordinate (f> or the coordinates (0, y, f), where / is the effective 
focal length of the slit camera. See text for further details. 



3 Derivation of an Epipolar Curve Equation 



In this section, we elaborate the derivation of a general epipolar curve equation for a pair 
of poly centric panoramas. Various coordinate systems are defined beforehand to help 
clarifying the geometrical transformation calculation in the derivation process. 



3.1 Coordinate Systems 

We define a ID discrete image coordinate system for each slit image with the coordinate 
denoted by v. The unit of this coordinate system is defined in terms of an image pixel. 
We define another ID real-number image-slit coordinate system with the coordinate 
denoted by y. The origins of the image and the image-slit coordinate systems are at the 
top and the center of the image slit respectively. Let Vc be the principle poinl0in discrete 
image space. The conversion between these two coordinates v and y isy = d{v — Vc), 
where d is the size of an image pixel in units of y. The image coordinate systems are 
shown in Fig. E 

We define a 2D discrete image coordinate system for each polycentric panoramic 
image. The coordinates are denoted as (u,v), which is an image pixel at column u 
and row v. Each column itself is a slit image, thus the coordinate v here is identical 
to the coordinate v in the slit image coordinate system. We define another 2D real- 
number image-surface coordinate system for each polycentric panoramic image with 
the coordinates (x, y). The origin of this coordinate system is defined at the center of 
the initial image slit. The conversion between these two coordinates {u, v) and (x, y) is 
X = du and y = d{v — Vc), where d is the size of an image pixel in units of y. 

^ The center pixel of the slit image where the optical axis of the slit camera passes through the 
image. 



42 



F. Fluang, S.K. Wei, and R. Klette 



A 3D slit-camera coordinate system, shown in Fig. Q1 is defined as follows. The 
origin coincides with the focal point of a slit camera, denoted as Oc- The z-axis is 
perpendicular to the image slit and passes through the center of the image slit. The 
y-axis is parallel to the image slit towards the direction of the positive y value in the 
image-slit coordinate system. An image point p on the image slit can be represented by 
the coordinates (0, y, /), where / is the effective focal length of the slit camera. Another 
way of representing an image point p is by an angular coordinate </>, which is the angle 
between the z-axis and the line passing through both the focal point and the image point. 
The conversion between the coordinates (0, y, /) and (j)is (f> = tan“^(y//). 

Each column of a polycentric panoramic image associates with a slit camera coor- 
dinate system. All the origins of the slit camera coordinate systems lie on the circle £. A 
3D turning-ri^ coordinate system is defined for each polycentric panoramic image. The 
origin, denoted as Oo, coincides with the center of the circle £. The z-axis passes through 
the center of the initial column of the panoramic image. The y-axis is parallel to all the 
slit images and towards the same direction as the y-axis of the slit camera coordinate 
system. We define an angle 6 to be the angle between the z-axis of the turning-rig co- 
ordinate system and the segment OqC. The orientation and the location of a slit camera 
coordinate system with respect to the turning-rig coordinate system can be described by 
a 3 X 3 rotation matrix Floe? 



Roc — 



cos{6 -F w) 0 — sin(6* + to) 
0 1 0 
sin(0 -F w) 0 cos(0 -F to) 



where wis the angle between the normal vector of the circle £ at Oc and the optical axis 
of the slit camera, and a 3 x 1 translation vector 



( r sin 9 \ 

0 I ’ 

r cos 9 j 

where r is the radius of the circle £. Figure |3 depicts the relationship between the 
slit camera coordinate systems and the turning-rig coordinate system. The conversion 
between the coordinate u and the angle 9 is 9 = {2Tru)ji, where i denotes the length 
(in pixel) of the panoramic image. 

A 3D world coordinate system is defined for the conversion between any pair of 
turning-rig coordinate systems for two poly centric panoramic images. The origin is 
denoted as O^j. The relationship between the world coordinate system and a turning-rig 
coordinate system associated to a panoramic image can be described by a 3 x 3 rotation 
matrix R^o and a 3D translation vector Two- 



3.2 Derivation 

Given is an image point of a polycentric panoramic image V (the source image), the task 
is to calculate the epipolar curve in another panoramic image V (the destination image). 

^ For example, a turntable or a turning head on a tripod. 
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Rotation axis 




Fig. 2. The geometrical relationship among a slit camera coordinate system (with origin at Oc), 
an associated turning-rig coordinate system (with origin at Oo), and the world coordinate system 
(with origin at Ow)- See text for further details. 



Every symbol associated with the destination panoramic image is added with a (') besides 
it to make a distinction from the corresponding symbol of the source panoramic image. 

If an image point p of P is given, then a 3D projection ray, denoted as Zq, emitting 
from a focal point C through the point p is dehned. The projection ray Zc with respect 
to the slit camera coordinate system can he described by P + XD, where Pis a zero 
3-vector, A £ 5ft is any scalar, and D is a unit directional vector: 

D=(o y I 

V ’ ViFTp' ViFTp 

= (0, sin cos . 

The projection ray Zc is hrst transformed to the turning-rig coordinate system of the 
panoramic image V. The resulting ray is denoted as Zq- The transformation formula is 
as follows: 



£o = 



l%C ^ + ~^OC + ^ 

Toe + 



The projection ray Zo is then transformed to the turning-rig coordinate system of the 
destination panoramic image V through the world coordinate system. The resulting ray 
is denoted as Zo>- The transformation formula is as follows: 



Zo' — Pwo' ( l%vo To< 



Two Tw 



XRwO' Rwo Tioc ^ 







rsin0 






sin(ft + uj) cos(j) 


Rwey 


^wo 


0 

r cos ft 




^ ^wo' ^wo 


sin (j) 

cos(ft -1- Lo) cos (j) 



( 1 ) 



44 



F. Fluang, S.K. Wei, and R. Klette 



The epipolar curve equation is an equation in terms of x' and y' which are the image- 
surface coordinates of the destination panoramic image V' . Every point (a:', y') is the 
projection of some 3D point on the ray Ecr • In other words, every (a;', y') is possibly the 
corresponding point of the given image point (it, v) of V. Let X' denote the slit image 
contributed to the column x' of V' and let C ' denote the associated slit camera’s focal 
point. For each column x' , the corresponding y' value can be found by the following 
two steps. First calculate the intersection point, denoted as Q of the ray £ 0 ' and the 
plane *Po' passing through C and X' . Second, project point Q' to the slit image X' to 
obtain the value of y' . 

The associated angle Q' is (27ra;')/(f'), where i' is the length of the destination 
panoramic image. The position of the focal point C ' with respect to the turning-rig 
coordinate system of V' can be described by (r' sin 0', 0, r' cos 0'), where r' is the 
radius of the circle £'. A unit vector perpendicular to the plane is (— cos(6*' -F 
cu'), 0, sin(0' -Fa;')), where a;' is the angle between the normal vector of £' at C' and 
plane *Po' • Therefore, the equation of plane is 

— cos(0' -F <J')x -F sin(0' -F a')z = r sin a', (2) 

where the variables x and z are with respect to the turning-rig coordinate system of the 
destination panoramic image V' . 

We substitute the x- and z- components of the projection ray £ 0 ' in Equ. I^into the 
plane equation Equ.|2l and solve the value of A. The intersection point O' can then be 
calculated from Equ.[I] We denote the obtained coordinates of Q' as {xo',yo', Zo>)- We 
have 



Xd 




Xo' cos(0' -F a>') — Zo’ sin(0' -F a;') -F r' sin a;' 


Vd 


= 


Vd 


_Zd _ 




Xd sin(0' -F a;') -F Zd cos(0' -F o’') — r' cos a;' 



which transforms the point O' to the slit camera coordinate system associated to the slit 
image X' and denote it as (xc' , yd , -Zc')- 

A 3D point is allowed to project onto the slit image if and only if the x-component 
of the coordinates with respect to the slit camera coordinate system is equal to zero. 
Therefore, the projection of a 3D point (0, yd^Zd) on the slit image X' is 

0 

f'Vo' 

x^i s\n{9' )-\- z cos{9' cos uj' ’ 

where /' is the effective focal length of the slit camera acquiring p'. Convert the projec- 
tion of a 3D point in the slit image X' back to the image-surface coordinate system of 
the panoramic image V' . Given x' , the value of y' is 



Xd sin( -F a') -F Zo' cos( -Fa;') — r' cos a;' 

To draw an epipolar curve in a discrete image, the coordinates {x' , y') are converted to 
the discrete image coordinate system (it', v') by it' = x' jd! and v' = Vc + y' /d' , where 
d' is the size of an image pixel in units of y' . 
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Fig. 3. An example of epipolar curves in a pair of horizontally-aligned polycentric panoramic 
images. 



4 Epipolar Curves in Special Cases 

We discuss the general equation in the context of a few special cases of panoramic 
images. 



4.1 Epipolar Curve in Horizontally-Aligned Polycentric Panoramas 

Consider two polycentric panoramic images, V and V' . The orientations and positions 
of their turning-rig coordinate systems with respect to the world coordinate system are: 
Rwo = Rwo' = kx3 and T^o = (0,0,0)'^ and T„o’ = respectively. These 

two panoramic images are called horizontally-aligned polycentric panoramas. Given is 
an image point (a;, y) on V, the equation of the epipolar curve on V' is 

/r'sino;'-rsin(^^-^-|-w')-fxCOs(^f^-|-a;')-|-fzSin(^f^-|-u;')\ 

Figure 0 shows an example of a pair of horizontally-aligned polycentric panoramas in 
a 3D synthetic scene: a squared room containing different objects such as a sphere, a 
box, a knot etc. with mapped real-images. The upper image shows the source panoramic 
image, V, with 30 test points in labeled and enumerated positions. The lower image 
shows the destination panoramic image, with the corresponding epipolar curves. The 
turning-rig coordinate systems associated to the top panoramic image is set to the world 
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Fig. 4. An example of epipolar curve in a pair of concentric panoramic images. 



coordinate system. The lower panoramic image V' was acquired at the same heighfl, 
with 2m to the east and Im to the north of the upper panoramic image V. The orientations 
of these two panoramic images are set to be identical. The effective focal lengths of the 
slit cameras used for acquiring these two panoramic images are both equal to 35.704 
mm. The radiuses of the circles, where slit camera’s focal points lie on, are both equal to 
40 mm. The orientations of the slit cameras with respect to each rotation axes are both 
equal to 45°. Each slit camera takes 1080 slit images for one panoramic image. Both 
image pixel’s width and height are equal to 1 /6 mm. 

4.2 Epipolar Curve in Concentric Panoramas 

A set of polycentric panoramic images is called concentric panoramic panoramas Qif 
the associated turning-rig coordinate systems are all coincident. Consider two concentric 
panoramic images, V and V' . Given an image point (a;,y) on V, the equation of the 
epipolar curve on V' is 



n'\ /r'sinw'-rsin(^ - ^ +w') \ 

V7 ) ' l^-rsinw-r'sin(^ - ^ - uj) j ' 



(3) 



FigureSshows an example of the epipolar curve in a pair of concentric panoramas. The 
effective focal lengths of slit cameras are both equal to 35.704 mm. The radiuses of the 

It follows that the y-components of the world coordinates of the associated rotation centers are 
equal. 
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Fig. 5. An example of the epipolar curves in a symmetric pair of concentric panoramas. 



circles are both equal to 40 mm. The orientation of the slit camera with respect to the 
rotation axis of the upper panoramic image is equal to 10° and of the lower image is 
equal to 300°. 

In particular uj' = (27t — u>), the two concentric panoramic images are called 
symmetric pair iinni- An important property about the symmetric panoramic image 
pair is that the epipolar curves become straight lines and coincide with image rows. The 
property can be shown from equation Equ. □ by setting f = f , r = r' , £ = i' , and most 
critically ui' = (2 tt — w), we have 

, / r sin( 27 T — w) — r sin( + 27 t — w) \ 

^ -rsinw - rsin(^ - ^ ^ J 

/ — sin w — SYo .(^ — ~ \ 

^ y — sin w — sin(^^ — — w) y 

= y- 

The value of y' is equal to y. Figure 0 shows an example of the epipolar lines in a 
symmetric pair of concentric panoramas. The parameters are the same as the previous 
settings. Only the orientations of the slit cameras are different. One is equal to 10° and 
the other is equal to 300° . Note that all the epipolar curves become straight lines and 
coincide with image rows. 
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4.3 Epipolar Curve in Single-Center Panoramas 

A set of polycentric panoramic images acquired with all the slit camera’s focal points 
coincided at a single point is called single-center panoramas. Consider two single- 
center panoramic images, V and V. The orientations and positions of their turning-rig 
coordinate systems with respect to the world coordinate system are: R„o — Rwo' = kx 3 
and = (0,0,0)^ and respectively. Each associated circle £ 

becomes a single points and angle w = 0, we have r = r' = 0 and u> = co' = 0. Given an 
image point (x, y) on V, the equation of the epipolar curve in V' is 



is a scalar. Figure 0 shows an example of the epipolar curves in a pair of single-center 
panoramas. The parameters of these two panoramic image acquisitions are identical to 
those of the polycentric panoramic pair except the orientations of the slit cameras are 
both equal to 0° and the radiuses of the circles are both equal to 0 mm. 

5 Conclusion and Further Work 

This paper proposes polycentric panoramas as a new model of panoramic images. We 
have shown that this model is able to describe a wide range of existing panoramic 
images. Hence, the epipolar curve equation derived is applicable in those more specific 
types of panoramic images. So far, only the epipolar curve equation itself is derived, no 
mathematical analysis has been done. Since there are many parameters involved in the 
equation, it is interesting to see how each of them affects the behavior of the epipolar 
curve. How to classify the epipolar curves based on properties of the curves? How many 
equivalent classes can be found? Given a set of uncalibrated panoramic images of one 
particular class, how many corresponding points are necessary to calibrate the desired 
parameters? In this paper, the panoramic image surface is chosen to be a perfect cylinder. 
However, there are other geometric forms such as an ellipse etc exist for use in some 
applications. It is interesting to derive a more general epipolar curve equation for those 
panoramic images. 
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Abstract. We combine uncalibrated Structure- from-Motion, lightfield 
rendering and view-dependent texture mapping to model and render 
scenes from a set of images that are acquired from an uncalibrated hand- 
held video camera. The camera is simply moved by hand around the 
3D scene of interest. The intrinsic camera parameters like focal length 
and the camera positions are automatically calibrated with a Structure- 
From-Motion approach. Dense and accurate depth maps for each cam- 
era viewpoint are computed with multi-viewpoint stereoscopic matching. 
The set of images, their calibration parameters and the depth maps are 
then utilized for depth-compensated image-based rendering. The render- 
ing utilizes a scalable geometric approximation that is tailored to the 
needs of the rendering hardware. 



1 Introduction 

This contribution discusses realistic scene reconstruction and visualization from 
real image streams that are recorded by an uncalibrated, freely moving hand-held 
camera. This approach allows to easily acquire 3D scene models from real-world 
scenes with high fidelity and minimum effort on equipment and calibration. 

Recently, quite some approaches to this problem have been investigated. 
Plenoptic modeling m, lightfield rendering m and the lumigraph have 
received a lot of attention, since they can capture the appearance of a 3D scene 
from images only, without the explicit use of 3D geometry. Thus one may be 
able to capture objects with very complex geometry that can not be modeled 
otherwise. Basically one caches views from many different directions all around 
the scene and interpolate new views from this large image collection. For realistic 
rendering, however, very many views are needed to avoid interpolation errors for 
in-between views. 

Structure from motion (SFM) approaches like JE] on the other hand try to 
model the 3D scene and the camera motion geometrically and capture scene 
details on polygonal (triangular) surface meshes. A limited set of camera views 
of the scene are sufficient to reconstruct the 3D scene. Texture mapping adds 

* Work was performed during stay at the Laboratory for Processing of Speech and 
Images, PSI-ESAT, K.U. Leuven, Belgium 
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the necessary fidelity for photo-realistic rendering of the object surface. Dense 
and accurate 3D depth estimates are needed for realistic image rendering from 
the textured 3D surface model. Deviation from the true 3D surface will distort 
the rendered images. 

The problem common to both approaches is the need to calibrate the image 
sequence. Recently it was proposed to combine a structure from motion ap- 
proach with plenoptic modeling to generate lightfields from uncalibrated hand- 
held camera sequences 0 . When generating lightfields from a hand-held camera 
sequence, one typically generates images with a specific distribution of the cam- 
era viewpoints. Since we want to capture the appearance of the object from all 
sides, we will sample the viewing sphere, thus generating a mesh of view points. 
To fully exploit hand-held sequences, we will therefore have to deviate from 
the regular lightfield data structure and adopt a more flexible rendering data 
structure based on the viewpoint mesh. Another important point in combining 
SFM and lightfield rendering is the use of scene geometry for image interpo- 
lation. The geometric reconstruction yields a geometric approximation of the 
real scene structure that might be insufficient when static texture mapping is 
used. However, view-dependent texture mapping as in |2| will adapt the texture 
dynamically to a static, approximate 3D geometry. 

In this contribution we will discuss the combination of Structure-from- 
Motion, lightfield rendering, and dynamic surface texturing. SFM delivers cam- 
era calibration and dense depth maps that approximate the scene geometry. 
Rendering is then performed by depth-compensated image interpolation from a 
mesh of camera viewpoints as generated by SFM. The novel image-based ren- 
dering method takes advantage of the irregular viewpoint mesh generated from 
hand-held image acquisition. We will first give a brief overview of the calibra- 
tion and reconstruction techniques by SFM. We will then focus on the depth- 
compensated image interpolation and show that only a coarse geometric approx- 
imation is necessary to guide the rendering process. Experiments on calibration, 
geometric approximation and image-based rendering verify the approach. 



2 Calibration and 3D-Reconstruction 

Uncalibrated Structure From Motion (SFM) is used to recover camera calibra- 
tion and scene geometry from images of the scene alone without the need for 
further scene or camera information. Faugeras and Hartley first demonstrated 
how to obtain uncalibrated projective reconstructions from image point matches 
alone j4lh] . Beardsley et al. P| proposed a scheme to obtain projective calibra- 
tion and 3D structure by robustly tracking salient feature points throughout an 
image sequence. This sparse object representation outlines the object shape, but 
does not give sufficient surface detail for visual reconstruction. Highly realistic 
3D surface models need a dense depth reconstruction and can not rely on few 
feature points alone. 

In the method of Beardsley was extended in two directions. On the 
one hand the projective reconstruction was updated to metric even for varying 
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internal camera parameters, on the other hand a dense stereo matching technique 
0 was applied between two selected images of the sequence to obtain a dense 
depth map for a single viewpoint. From this depth map a triangular surface 
wire-frame was constructed and texture mapping from one image was applied 
to obtain realistic surface models. In | 7 | the approach was further extended to 
multi-viewpoint depth analysis. The approach can be summarized in 3 steps: 

— Camera self-calibration and metric structure is obtained by robust tracking 
of salient feature points over the image sequence, 

— dense correspondence maps are computed between adjacent image pairs of 
the sequence, 

— all correspondence maps are linked together by multiple view point linking 
to fuse depth measurements over the sequence. 



2.1 Calibration of a Mesh of Viewpoints 

When very long image sequences have to be processed with the above described 
approach, there is a risk of calibration failure due to several factors. For one, the 
calibration as described above is built sequentially by adding one view at a time. 
This may result in accumulation errors that introduce a bias to the calibration. 
Secondly, if a single image in the sequence is not matched, the complete cali- 
bration fails. Finally, sequential calibration does not exploit the specific image 
acquisition structure used in this approach to sample the viewing sphere. 

In 0 a multi-viewpoint calibration algorithm has been described that allows 
to actually weave the viewpoint sequence into a connected viewpoint mesh. This 
approach is summarized in the following section. 



Image pair matching. The basic tool for viewpoint calibration is the two- view 
matcher. Corresponding image features rrii, ruk have to be matched between the 
two images of the camera viewpoints Pi,Pk. The image features are projections 
of a 3D feature point M into the Images Ii,Ik in homogeneous coordinates: 

rrii = piPiM , mk = pkPkM , P = K[R^\-R^c] (I) 

with p a non-zero scaling factor, K = camera calibration matrix, R = orien- 
tation and c = position of the camera. To solve for P from m^, ruk we employ a 
robust computation of the Fundamental matrix Fik with the RANSAC (RAN- 
dom SAMpling Consensus) method EH- Between all image correspondences the 
fundamental image relation (the epipolar constraint) holds 

wfF)_fcm/c = 0 (2) 

Pi,k{ 3 x 3 ) is a linear rank-2 matrix. A minimum set of 7 feature correspon- 
dences is picked from a large list of potential image matches to compute a 
specific F. For this particular F the support is computed from the other po- 
tential matches. This procedure is repeated randomly to obtain the most likely 
Fik with best support in feature correspondence. From F we can compute the 
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3x4 camera projection matrices Pi and Pk- The fundamental matrix alone does 
not suffice to fully compute the projection matrices. In a bootstrap step for the 
first two images we follow the approach by Beardsley e.a. p. Since the camera 
calibration matrix K is unknown a priori we assume an approximate K to start 
with. The first camera is then set to Pq = to coincide with the world 

coordinate system, and the second camera P± can be derived from the epipole e 
(projection of camera center into the other image) and F as 



Pi = K [[e]xF + ea^\pe] , [e]a; 



0 -63 62 
63 0 -ei 

— 62 ei 0 



( 3 ) 



Pi is defined up to global scale p and the unknown plane TTjnf, encoded in 
(see also [El). Thus we can only obtain a projective reconstruction. The vector 
should be chosen such that the left 3x3 matrix of Pi best approximates an 
orthonormal rotation matrix. The scale p is set such that the baseline length 
between the first two cameras is unity. K and will be determined during 
camera self-calibration. 

Once we have obtained the projection matrices we can triangulate the cor- 
responding image features m,i,mk with Pi,Pk to obtain the corresponding 3D 
object features M. The object points are determined such that their reprojection 
error in the images is minimized. In addition we compute the point uncertainty 
covariance to keep track of measurement uncertainties. The 3D object points 
serve as the memory for consistent camera tracking, and it is desirable to track 
the projection of the 3D points through as many images as possible. This pro- 
cess is repeated by adding new viewpoints and correspondences throughout the 
sequence. Finally constraints are applied to the cameras to obtain a metric re- 
construction. A detailed account of this approach can be found in j1'il13j . 



Estimating the viewpoint topology. Since we are collecting a large amount 
of images from all possible viewpoints distributed over the viewing sphere, it 
is no longer reasonable to consider a sequential processing along the sequence 
frame index alone. Instead we would like to evaluate the image collection in order 
to robustly establish image relationships between all nearby images. We need to 
define a distance measure that allows to estimate the proximity of two viewpoints 
from image matches alone. We are interested in finding those camera viewpoints 
that are near to the current viewpoint and that support calibration. Obvious 
candidates for these are the preceding and following frames in a sequence, but 
normally those viewpoints are taken more or less on a linear path due to camera 
motion. This near-linear motion may lead to degeneracies and problems in the 
calibration. We are therefore also interested in additional viewpoints that are 
perpendicular to the current direction of the camera motion. If the camera sweeps 
back and forth over the viewpoint surface we will likely approach the current 
viewpoint in previous and future frames. Our goal is now to determine which of 
all viewpoints are nearest and most evenly distributed around our current view. 
So far we do not know the position of the cameras, but we can compute the F- 
Matrix from corresponding image points. For each potential neighbor image A 
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we compute w.r.t. the current image 7c- To measure proximity and direction 
of the matched viewpoint w.r.t. the current one, we can exploit the image epipole 
as well as the distribution of the correspondence vectors. 

Direction: The epipole determines the angular direction Og of the neighboring 
camera position w.r.t. the current image coordinates, since it represents the 
projection of the camera center into the current image. Those viewpoints whose 
epipoles are most evenly distributed over all image quadrants should be selected 
for calibration. 

Proximity: The distribution of the corresponding matches determines the dis- 
tance between two viewpoints. Consider a non-planar scene and general motion 
between both cameras. If both camera viewpoints coincide we can cancel out the 
camera orientation change between the views with a projective mapping (rec- 
tification) and the corresponding points will coincide since no depth parallax 
is involved. For a general position of the second camera viewpoint, the depth 
parallax will cause a residual correspondence error Cr after rectification that is 
proportional to the baseline distance between the viewpoints. We can approx- 
imate the projective rectification by a linear affine mapping that is estimated 
from the image correspondences. We therefore define the residual correspon- 
dence error after rectification as proximity measure for nearby viewpoints. 
The viewpoints with smallest Cr are closest to the current viewpoint. 



Weaving the viewpoint mesh. With the distance measure at hand we can 
build a topological network of viewpoints. We start with an arbitrary image of 
the sequence and compute Og and Cr for subsequent images. If we choose the 
starting image as first image of the sequence, we can proceed along the frame 
index and find the nearest adjacent viewpoints in all directions. From this seed 
views we proceed recursively, building the viewpoint mesh topology over all 
views. The mesh builds along the shortest camera distances very much like a 
wave propagating over the viewpoint surface. 



2.2 3D Geometry Estimation 

Once we have retrieved the metric calibration of the cameras we can use image 
correspondence techniques to estimate scene depth. We rely on stereo matching 
techniques that were developed for dense and reliable matching between adjacent 
views. The small baseline paradigm suffices here since we use a rather dense 
sampling of viewpoints. 

For dense correspondence matching an area-based disparity estimator is em- 
ployed. The matcher searches at each pixel in one image for maximum normal- 
ized cross correlation in the other image by shifting a small measurement window 
(kernel size 7x7) along the corresponding epipolar line. Dynamic programming 
is used to evaluate extended image neighborhood relationships and a pyramidal 
estimation scheme allows to reliably deal with very large disparity ranges |2|- 
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The geometry of the viewpoint mesh is especially suited for further improve- 
ment with a multi viewpoint refinement jjj . Each viewpoint is matched with all 
adjacent viewpoints and all corresponding matches are linked together to form 
a reliable depth estimate. Since the different views are rather similar we will 
observe every object point in many nearby images. This redundancy is exploited 
to improve the depth estimation for each object point, and to refine the depth 
values to high accuracy. 

2.3 Experimental Results for Surface Mesh Calibration 

To evaluate our approach, we recorded a test sequence with known ground truth 
from a calibrated robot arm. The camera is mounted on the arm of a robot of 
type SCORBOT-ER VII. The position of its gripper arm is known from the 
angles of the 5 axes and the dimensions of the arm. The robot sampled a 8 x 8 
spherical viewing grid with a radius of 230 mm. The viewing positions enclosed a 
maximum angle of 45 degrees which gives an extension of the spherical viewpoint 
surface patch of 180x 180 mm^. The scene (with size of about 150x150x100 mw?) 
consists of a cactus and some metallic parts on a piece of rough white wallpaper. 




Fig. 1. Top left: one image of the robot sequence. Top middle: The distribution of the 
camera viewpoints over the 3D scene. Top right: sequential camera path as obtained 
from tracking along the camera path. Bottom: Intermediate steps of the mesh building 
after 4, 32, and 64 images. The camera viewpoints are indicated by pyramids that are 
connected by the viewpoint mesh. The black points in the background are tracked 3D 
feature points. One can see how the 2D mesh topology is building over the viewpoint 
surface. 
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Table 1. Ground truth comparison of 3D camera positional error between the 64 
estimated and the known robot positions [in % of the mean object distance of 250 
mm]. 



Camera position 


projective 


similarity 


Tracking Error [%] 


mean 


s.dev 


mean 


s.dev 


sequential 


1.08 


0.69 


2.31 


1.08 


2D viewpoints 


0.57 


0.37 


1.41 


0.61 



One of the original images is shown in fig.QKtop left) together with the dis- 
tribution of the camera viewpoints of the robot arm (top middle) . Each camera 
position is visualized as little pyramid. In fig.^bottom) calibration using a view- 
point mesh results are shown. The mesh buildup is indicated by the estimated 
camera viewpoints (pyramids) and their topological relation (mesh connecting 
the cameras). Each connection indicates that the fundamental matrix between 
the image pair has been computed. 

A quantitative evaluation of the tracking was performed by comparing the 
estimated metric camera pose with the known Euclidean robot positions. We 
anticipate two types of errors: 1) a stochastic measurement noise on the camera 
position, and 2) a systematic error due to a remaining projective skew from 
imperfect self-calibration. We also compared the simple sequential calibration 
that estimates Fi^k along adjacent images of the recording path only (figd top 
right), with the novel 2D mesh calibration method (see figCl bottom). 

For comparison we transform the measured metric camera positions into 
the Euclidean robot coordinate frame. With a projective transformation we can 
eliminate the skew and estimate the measurement error. We estimated the pro- 
jective transform from the 64 corresponding camera positions and computed the 
residual distance error. The distance error was normalized to relative depth by 
the mean surface distance of 250 mm. The mean residual error dropped from 
1.1% for sequential tracking to 0.58% for viewpoint weaving (see table GJ- The 
position repeatability error of the robot itself is 0.08%. 

If we assume that no projective skew is present then a similarity transform 
will suffice to map the coordinate sets onto each other. A systematic skew how- 
ever will increase the residual error. To test for skew we estimated the similarity 
transform from the corresponding data sets and evaluated the residual error. 
Here the mean error increased to 1.4% for mesh tracking which is still good for 
pose and structure estimation from fully uncalibrated sequences. 



3 Plenoptic Modeling and Rendering 

After determining the pose and projection properties of the moving camera we 
want to use the calibrated cameras to create a scene model for visualization. 

One possible method is lightBeld renderingl^ . To create a lighfield model for 
real scenes, a large number of views from many different angles are taken. Each 
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view can be considered as bundle of light rays passing through the optical center 
of the camera. The set of all views contains a discrete sampling of light rays 
with according color values and hence we get discrete samples of the plenoptic 
function. The light rays which are not represented have to be interpolated. 

The original 4-D lightfield data structure uses a two-plane parameterization. 
Each light ray passes through two parallel planes with plane coordinates (s, t) 
and (u,v). Thus the ray is uniquely described by the 4-tuple (u,v,s,t). The 
(s, t)-plane is the viewpoint plane in which all camera focal points are placed on 
regular grid points. The projection parameters of each camera are constructed 
such, that the (u, r;)-plane is their common image plane and that their optical 
axes are perpendicular to it. 

Often, real objects are supposed to be Lambertian, meaning that one sur- 
face point has the same radiance value viewed from all possible directions. This 
implies that two viewing rays have the same color value if they intersect the 
surface at the same point. If specular effects occur, this is not true any more. 
The radiance value will change with changing viewing direction, but for a small 
change of the viewing angle, the color value also will change just a little. Con- 
sequently, two viewing rays have similar color values, if their direction is similar 
and if their point of intersection is near the surface of the scene. 

To render a new view we suppose to have a virtual camera pointing to the 
scene. For each pixel we can determine the position of the corresponding virtual 
viewing ray. The nearer a recorded ray is to this virtual ray the greater is its 
support to its color value. So the general task of rendering views from a collection 
of images will be to determine those viewing rays which are nearest to the virtual 
one and to interpolate between them depending on their proximity. 

Linear interpolation between the viewpoints in (s,t) and (u,v) introduces a 
blurred image with ghosting artifacts. In reality we will always have to choose 
between high density of stored viewing rays with high data volume and high 
fidelity, or low density with poor image quality. If we know an approximation of 
the scene geometry (see fig. 0 left), the rendering result can be improved by an 
appropriate depth-dependent warping of the nearest viewing rays as described 
in p|. 

Having a sequence of images taken with a hand-held camera, in general the 
camera positions are not placed at the grid points of the viewpoint plane. In 
0 a method is shown for resampling this regular two-plane parameterization 
from real images recorded from arbitrary positions (re- binning). The required 
regular structure is re-sampled and gaps are filled by applying a multi-resolution 
approach, considering depth corrections. The disadvantage of this re-binning 
step is that the interpolated regular structure already contains inconsistencies 
and ghosting artifacts due to errors in the scantily approximated geometry. To 
render views, a depth corrected look-up is performed. During this step, the effect 
of ghosting artifacts is repeated, so duplicate ghosting effects occur. 



Image-Based Rendering from Uncalibrated Lightfields 



59 



3.1 Representation with Recorded Images 

Our goal is to overcome the problems as described in the last section by relaxing 
the restrictions imposed by the regular lightfield structure and to render views 
directly from the calibrated sequence of recorded images using local depth maps. 
Without loosing performance we directly map the original images onto a surface 
viewed by a virtual camera. 



2— D Mapping: In this paragraph, a formalism for mapping image coordinates 
onto a plane A is described. The following approaches will use this formalism 
to map images onto planes and vice versa. We define a local coordinate system 
on A giving one point ao on the plane and two vectors ai and a 2 spanning the 
plane. So each point p of the plane can be described by the coordinates xa, 
Ua'- P = (ai,a 2 ,ao) (xA,yA, l)"^- The point p is perspectively projected into a 
camera which is represented by the 3x3 matrix Q = KR"^ and the projection 
center c (same notations as above). Matrix R is the orthonormal rotation matrix 
and K is an upper triangular calibration matrix. The resulting image coordinates 
X, y are determined by p{x,y, 1)^ = Qp — Qc. Inserting above equation for p 
results in 



P 



Q(ai, a 2 , ao 




( 4 ) 



The value p is an unknown scale factor. Each mapping between a local plane 
coordinate system and a camera can be described by a single 3x3 matrix 
B = Q(ai,a 2 ,ao - c). 

We can extend our mapping procedure to re-project the image of one camera 
(with center Ci) onto the plane followed by a projection into the other camera 
(with center cv). Then the whole mapping is performed by 



(a;v,?/y, 1)^ = BvB; ^(xi,yi,l)'^ . (5) 

The 3 X 3-matrix BvBr^ describes the projective mapping from one camera to 
another via a given plane. Figure [fright) shows this situation for two camera 
positions Cv and Cj. 



Mapping via global plane: We apply the previously described method of 
mapping an image via a given plane to create a virtual scene view directly 
from real ones. In a first approach, we approximate the scene geometry by a 
single plane A. This step seems to be really erroneous but as mentioned before, 
the lightfield-approach exactly supposes this approximation. In the most simple 
approach, we follow this method, although at regions, where the scene surface 
differs much from the plane A, a blurring effect will be visible. But in the next 
section we will improve our approach for refined geometric scene descriptions. 

Following the lightfield approach, we have to interpolate between neighboring 
views to construct a specific virtual view. Considering the fact mentioned above 
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Fig. 2. Left: depth-dependent interpolation errors in the two-plane lightfield approach. 
The new viewing ray r is interpolated by the weighted sum of L and L+i from the ad- 
jacent cameras Si and Si+i. Since the real surface geometry deviates from the planar 
intersection point u at the focal plane, ghosting artifacts occur. Right: Projective map- 
ping from one camera into another via a plane. 



that the nearest rays give the best support to the color value of a given ray, we 
conclude that those views give the most support to the color value of a particular 
pixel whose projection center is closest to the viewing ray of this pixel. This is 
equivalent to the fact that those real views give the most support to a specified 
pixel of the virtual view whose projected camera centers are close to its image 
coordinate. We restrict the support to the nearest three cameras (see figure Ej). 
To determine these three neighbors we project all camera centers into the virtual 
image and perform a 2-D triangulation. Then the neighboring cameras of a pixel 
are determined by the corners of the triangle which this pixel belongs to. The 
texture of such a triangle — and consequently a part of the reconstructed image 
— is drawn as a weighted sum of three textured triangles. 

These textures are extracted from the original views by directly mapping 
the coordinates Xi,yi of image i into the virtual camera coordinates xv,yv by 
applying equation El 

To overlay these three textures, we calculate a weighted sum of the color 
values. Each triangle is weighted with factor 1 at the corner belonging to the 
projection center of the corresponding real view and with weight 0 at both others. 
In between, the weights are interpolated linearly similar to Gouraud-Shading, 
where the weights describe a plane ramp in barycentroic coordinates. Within 
the triangle, the sum of the three weights is 1 at each point. The total image 
is built as a mosaic of these triangles. Although this technique assumes a very 
sparse approximation of geometry, the rendering results show only few ghosting 
artifacts (see sectional) at those regions where the scene geometry differs much 
from the approximating plane. 
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Mapping via local planes: The results can be further improved by consider- 
ing local depth maps. Spending more time for each view, we can calculate the 
approximating plane of geometry for each triangle in dependence on the actual 
view. As the approximation is not done for the whole scene but just for that part 
of the image which is seen through the actual triangle, we don’t need a consistent 
3-D model but we can use the — normally erroneous — local depth maps. The 
depth values are given as functions Zi of the coordinates in the recorded images 
Zi{{xi,yi, 1)^). They describe the distance of a point perpendicular to the image 
plane. Using this depth function, we calculate the 3-D coordinates of those scene 
points which have the same 2-D image coordinates in the virtual view as the 
projected camera centers of the real views. The 3-D point pi which corresponds 
to the real camera i can be calculated as 

Pi = Zi(Qidi)di -bCi , (6) 

where d, =n(ci — cv). The function n scales the given 3-D vector such, that its 
third component equals one. We can interpret the points pi as the intersection 
of the line cvcf with the scene geometry (see figure 0) . The 3-D coordinates of 
triangle corners define a plane which we can use to apply the same rendering 
technique as described above for one global plane. 



Refinement: Finally, if the triangles exceed a given size, they can be subdivided 
into four sub-triangles by splitting the three sides into two parts, each. We 
determine the 3-D points corresponding to the midpoint of each side by applying 



record!^ positions 




scene geometry 




Fig. 3. Left: Drawing triangles of neighboring projected camera centers and approx- 
imating scene geometry by one plane for the whole scene or for one camera triple. 
Right: Refinement of triangulation by inserting new 3-D points corresponding to the 
midpoints of the triangle sides. 
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the same look-up method as used for radiance values to find the corresponding 
depth value. After that, we reconstruct the 3-D point using equation 0 and 
project it into the virtual camera resulting in a point near the side of the triangle. 
It is just in the neighborhood of the side, but this doesn’t matter really. Merely, 
the triangulation structure will be changed slightly. One has to take care of 
avoiding inconsistencies caused by this look-up. For a given pair of neighboring 
views, the look-up always has to be done in the same depth map. A simple 
method is to do this look-up not for each triangle causing several look-ups for the 
same triangle side, but to determine the 3-D points for each pair of neighboring 
views in a preprocessing step. This also improves efficiency. 

For each of these so created sub-triangles, a separate approximative plane is 
calculated in the above manner. Of course, further subdivision can be done in 
the same way to improve accuracy. Especially, if just few triangles contribute to 
a single virtual view, this subdivision is really necessary. This hierarchical refine- 
ment of the geometry can be performed adaptively depending on the required 
accuracy and available computational resources, hence allowing easy scalability 
towards scene complexity. 



3.2 Scalable Geometry for Interpolation 

The approach as discussed above can be used directly for scalable geometric scene 
approximation. The SFM reconstruction delivers local depth maps that contain 
the 3D scene geometry for a particular view point. However, sometimes the depth 
maps are sparse due to a lack of features. In that case, no dense geometric recon- 
struction is possible, but one can construct an approximate geometry (a mean 
global plane or a very coarse triangulated scene model) . This coarse model is not 
sufficient for geometric rendering but allows depth compensated interpolation. 
In fact, the viewpoint-adaptive light field interpolation combined with approxi- 
mative geometry combines to viewpoint-dependent texture mapping. Since only 
the nearby images are used for interpolation, the rendered image will be quite 
good even when only a very coarse geometry is used. The standard lightfield 
interpolation for example uses only a planar scene approximation that is not 
even adjusted to the mean scene geometry. Hence our approach is less distorting 
than standard lightfield rendering. The rendering will also be more realistic than 
standard texture mapping since we capture the reflectance characteristics of the 
scene. 

The adaptive refinement of the geometry (starting with a mean global planar 
geometry and adapting to surface detail) can be used to control the amount of 
geometry that we need for interpolation. For every viewpoint the scene is divided 
into local planes until a given image quality (measured by image distortion) has 
been reached. On the other hand, one can select a fixed level of geometric subdi- 
vision based on the available rendering power of the texture mapping hardware. 
For a given performance one can therefore guarantee that rendering is done in 
constant time. 
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This geometric scalability is very useful in realtime environments where a 
fixed frame rate is required, or in high realism rendering where imaging quality 
is premium. 

4 Experimental Results 

We tested our approach also with an uncalibrated hand-held sequence. A digital 
consumer video camera (Sony DCR-TRV900 with progressive scan) was swept 




Fig. 4. Top: Two images from hand-held office sequence. Please note the changing 
surface reflections in the scene. Middle: Camera tracking with viewpoint mesh(left) 
and depth map from a specific viewpoint (right). Bottom: 3D surface model of scene 
rendered with shading (left) and texture (right). 
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freely over a cluttered scene on a desk, covering a viewing surface of about 1 m?. 
The resulting video stream was then digitized on an SGI 02 by simply grabbing 
187 frames at more or less constant intervals. No care was taken to manually 
stabilize the camera sweep. 

Fig. 0 (top) shows two images of the sequence. Fig. 0 (middle, left) illustrates 
the zigzag route of the hand movement as the camera scanned the scene. The 
viewpoint mesh is irregular due to the arbitrary hand movements. The black dots 
represent the reconstructed 3D scene points. From the calibrated sequence we 
can compute any geometric or image based scene representation. As an example 
we show in fig. 0(bottom) a geometric surface model of the scene with local scene 
geometry that was generated from the depth map (see fig. 0 middle, right). 

Fig. 0 shows different refinement levels of the same geometry as viewed from 
a particular camera viewpoint. Even with this very rough approximation, very 
realistic view interpolation can be achieved. 

Fig.EI shows rendering of the same scene without depth compensation (left) 
and with depth compensation using geometric refinement (mesh lx subdivided, 
middle) . Even without geometry the rendering looks good, but due to the missing 
depth compensation some ghosting artifacts occur. This is the result achievable 
with the standard lightfield approach, but already exploiting the general view- 
point mesh calibration. With geometry compensation the rendering is improved 
substantially and the ghosting artifacts disappear. Note that we utilized a very 
coarse geometrical approximation only as displayed in fig 0 (top right) but still 
achieve high rendering quality. 

The main advantage of lightfield rendering is that the rendering is local, 
meaning that all depth and color information for a pixel is taken from the three 
nearest camera images only. This allows changes in surface reflectivity when 
viewing the scene from different angles. Some rendering results with surface 




Fig. 5. Viewpoint geometry for depth-compensated interpolation with with different 
levels of adaptive refinement (level of mesh subdivision: 0,1,2, 4 (top left to bottom 
right). 







Fig. 6. Top: Novel views rendered from the viewpoint mesh without (left) and with 
(middle) depth-compensation. Only a coarse geometrical approximation is used (see 
upper right image of fig. 0. Right: Two views of the scene rendered from different 
viewpoints with changing reflections. The rendering quality is very high due to the 
natural appearance of the reflections. 



reflections are shown in fig. |H| (right). The same part of the scene is rendered 
from different viewpoints, demonstrating that the reflections are preserved and 
the images appear very natural. 

5 Conclusions 

We have presented a system for calibration, reconstruction, and plenoptic ren- 
dering of scenes from an uncalibrated hand-held video camera. The calibration 
exploits the proximity of viewpoints by building a viewpoint mesh that spans the 
viewing sphere around a scene. Once calibrated, the viewpoint mesh can be used 
for image-based rendering or 3D geometric modeling of the scene. The image- 
based rendering approach was discussed in detail and a new rendering approach 
was presented that renders directly from the calibrated viewpoint using depth- 
compensated image interpolation. The level of geometric approximation is scal- 
able, which allows to adapt the rendering to the given rendering hardware. For 
the rendering only standard planar projective mapping and Gouraud-weighting 
is employed which is available in most rendering hardware. 
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Abstract. At the moment, fundamental changes in sensors, platforms, 
and applications are taking place. Commercial digital camera systems on 
airborne- and spaceborne platforms are becoming available. This paper 
describes current activities in sensor development and data processing. 
The German Aerospace Centre (DLR) has been involved in digital cam- 
era experiments for the last 15 years. An example is the satellite sensor 
MOMS-02, which was successfully flown on the space shuttle D2 mission 
and later on PRIRODA, a module of the Russian space station MIR. 

In the last two years the DLR Institute for Space Sensor Technology and 
Planetary Exploration has been involved in a commercial camera devel- 
opment of the LH-Systems digital three line stereo sensor ADS40. The 
institute also delivers digital image data from flight campaigns with the 
digital camera HRSC-A for stereo processing together with the French 
company ISTAR. 



1 Introduction 

Today a lot of government agencies, private companies and science institutions 
make use of airborne and satellite remote sensing products for mapping re- 
sources, land use/land cover and for monitoring changing phenomena. Remote 
sensing, combined with geographic information system technologies, can produce 
information about current and future resource potentials. 

For mapping up to now film-based aerial cameras with traditional techniques 
have normally been used. The digital workflow starts after film development with 
the scanning of the film and image processing including stereo visualisation. By 
using different types of films these cameras can also be used for remote sensing 
to some extent. 

Extensive research and industrial developments within the last 10 years in 
CCD technology, increasing computer performance and data storage capacity 
offer the opportunity to replace the film-based aerial camera for many applica- 
tions and also to improve the quality of the photogrammetric and remote sensing 
products. 

Digital systems are cost saving when used over a longer time period (no film, 
no photo lab and better automation possibility), the product derivation is time 
saving (no film development, no scanning and possible automation of the digital 
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Fig. 1. CCD-line and matrix imaging principle 



workflow), and the images can be of higher quality (higher radiometric resolution 
and accuracy, reproducible colour and in-flight image control). 

Digital systems make new applications possible, because of new kinds of 
information (e.g. multispectral measurements), new products (multispectral in- 
formation together with digital elevation models) and digital image processing 
for multimedia applications. 



2 Imaging Principles for Airborne Digital Sensors 

An airborne digital sensor must provide a large held of view and swath width, 
high radiometric and geometric resolution and accuracy, multispectral imagery 
and stereo. This can be realised with area and line CCD-arrays. The different 
imaging principles are shown in figure 1. 

Benefits of the matrix camera are a defined rigid geometry of the image, cen- 
tral perspective image geometry and a simple interfacing to an existing softcopy 
system. The main disadvantages are that available models have 4k • 7k pixels or 
less, matrices are extremely expensive, a shutter for matrix readout is necessary 
and the realisation of true colour needs additional cameras. 

The three-line (stereo) concept achieves in views forward from the aircraft, 
vertically down and looking backward. The imagery from each scan line pro- 
vides information about the objects on the ground from different viewing angles 
assembled into strips (figure 2). Attitude disturbances of the airborne platform 
results in image distortions. With exact knowledge of the flight path an image 
and also stereo reconstruction of the surface is possible. 

The main advantage of the CCD-line camera is the best achievable relation 
between pixel number in the image and prize, a simple realisation of colour (RGB 
and NIR) is possible and no shutter is needed. The continuous data stream allows 
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Fig. 2. CCD-line and matrix imaging principle 



simple (on-line) correction of defects and PRNU of the CCD as well as a simple 
data compression. The disadvantages are that accurate exterior orientation is 
necessary for each measured line and some products are only available after 
complete stereo processing. 

3 Detectors for Digital Mapping Sensors 

In the last few month new detectors entered the market and are available now. 
Table 1 lists available CCD-matrices and Table 2 the CCD-lines. 

Both large format CCDs from Philips and Fairchild are only available as 
experimental examples. They are extremely expensive and not available in larger 
numbers. Most of the digital imaging systems are based on 4k • 4k and 4k • 7k 
matrixes. 

Large format CCD-matrices have smaller numbers of pixel along one image 
dimension in comparison to CCD-lines. To fit the resolution/swath width re- 
quirements for airborne digital cameras, multiple camera head solutions must 
be used. 



Table 1. Large format CCD-matrices 



Manufacturer 


Model 


Photopixel 


Pixelsize 


Kodak 


KAF 16800 


4096 • 


4096 


9 • 9 


FAIRCHILD 




4096 • 


4096 


15 • 15 


Philips 


FTF7040 


7000 • 


4000 


12 • 12 


Philips 




9216 ■ 


7168 


12 • 12 


FAIRCHILD 




9216 ■ 


9216 


8.75 ■ 8.75 
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Table 2. CCD-lines for high resolution scanner 



Manufacturer 


Model 


Photopixel 


Pixelsize [^J.rn^\ 


THOMSON 


TH7834 


12,000 


6.5 ■ 6.5 


THOMSON 


customise 


2 ■ 12,000 (staggered) 


6.5 ■ 6.5 


EEV 


CCD21-40 


12.288 


8 ■ 8 


KODAK 


KLI-10203 


3 ■ 10,200 (true color) 


7 ■ 7 


KODAK 


KLI-14403 


3 • 14,204 (true color) 


5 • 5 


EAIRCHILD 


CCD194 


12,000 


10 • 8.5 


SONY 


ILX734K 


3 ■ 10,500 (true color) 


8 ■ 8 



4 High Resolution Airborne Sensors 



In this chapter examples of airborne matrix- and line-cameras will be presented. 



4.1 Airborne Matrix- Camera 

The following system are examples or test systems 



IGN-Sensor ^ 

IGN (Institut GEographique National, Paris - France) has been testing its 
system for several years. The panchromatic system is based on the 4k-4k Kodak 
matrix GGD. To improve the signal-to-noise ratio the system can be run in a 
TDI (time delay and integration) mode. 



Z/I DMC |2| 

DMG (Digital Modular Gamera) is the first commercial system, which is 
based on a matrix detector. The Z/I Imaging Gorporation is a joint venture of 
Intergraph and Garl Zeiss. This system was first announced at the Photogram- 
metric Week, September 1999. A prototype was exhibited at the ISPRS 2000 
conference. Gamera description and data workflow can be found in [2]. 

To reach the resolution/swath width criterion, a four camera head solution 
was established (figure 3). Four additional cameras at the same platform makes 
true colour and NIR images possible. The system is based on the Philips 4k • 7k 
matrix. The ground coverage of this camera is shown in figure 3. 

The Z/I approach needs additional calibration procedures and processing 
to form one image from the four parts with an image size of about 13k • 8k. 
The camera can also use TDI to improve radiometric image quality. Additional 
attitude disturbances (roll, pitch and yaw) can influence the geometric image 
quality in the TDI-mode. Figure 4 shows an example image. The Siemens star- 
image gives an impression of image quality. 
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Fig. 3. DMC camera head (left) and ground coverage of the 4 camera-head system 
(right) 




Fig. 4. First example image of the DMC, flight height was 300m and GSD 5 cm 



4.2 Airborne Line Scanner 

Since the end of the 80 ’s various CCD-line stereo scanner have been flown on air- 
borne platforms. All known systems are summarised with their main parameters 
in Table 3. 

The MEOSS, WAOSS and HRSC systems are mainly designed for operation 
on spacecraft and only used for system/data testing and evaluation. WAAC 
and HRSC-A are derived from spacecraft systems (WAOSS, HRSC) for airborne 
applications. TLC is the only non German system. Most of the systems (except 
TLC and DPA) were designed and build from or together with DLR. The typical 
stereo angle is between 20° and 25°. To overcome occlusion effects, e.g. in cities 
in HRSC and ADC/ADS smaller stereo angles are implemented. In the DPA 
and ADS40 panchromatic stereo lines consist of two lines. DPA is a solution 
with two optics, a three line focal plate with 6k CCD lines in each camera, and 
the ADS stereo line are staggered arrays. In the following WAAC, HRSC and 
ADS40 will explained more in detail. 
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Table 3. Large format CCD-matrices 



System Focal-length Pixels FOV Stereo-angle Bit’s GSD/Swath/Height 



MEOSS 


61.6 mm 


3236 


30° 


23.5° 


8 bit 2m/6.4km/llkm 


WAOSS 


21.7 mm 


5184 


80° 


25° 


11 bit lm/5km/3km 


DPA 


80 mm 


2-6,000 


74° 


25° 


8 bit 25cm/3km/2km 


WAAC 


21.7 mm 


5185 


0 

O 

00 


25° 


11 bit lm/5km/3km 


TLS 


38.4 mm 


7500 


0 

CO 

00 


21.5° 


10 bit 10cm/750m/500m 


HRSC-A 


175 mm 


5184 


12° 


18.9°, 12.8° 


8 bit 12cm/620m/3km 


ADC-EM 80 mm 


12,000 


52° 


17°, 25° 


12 bit 25cm/3km/3km 


ADS40 


62.5 mm 


2-12,000 


64° 


17°, 25° 


12 bit 12.5cm/3km/3km 




WAOSS / WAAC 0 

WAOSS (Wide Angle Optoelectronic Stereo Scanner) was a part of the imag- 
ing payload for the Mars 96 mission [Alberts, 1996], which failed in November 
1996. This camera was mainly designed for medium and large scale observa- 
tion. Therefore the camera’s field of view is 80°. WAAC (Wide Angle Airborne 
Scanner) is a modified airborne instrument. The camera has some outstanding 
features: 

— Before the on-line compression all necessary correction of PRNU (pixel re- 
sponse non-uniformity) and shading effects are corrected. 

— WAOSS/WAAC is the first imaging system, which provides more than 8 bit. 

— It is the smallest and most lightweight system in comparison to all other 
scanners (about 4.5 kg for WAAC). 

Swath width and GSD are appropriate for multisensor applications. As an 
example, a joint flight together with the hyperspectral imaging spectrometer 
DAIS is shown in figure 5. 

Background image are orthorectified WAAC data merged with the thermal 
infrared channel of the DAIS. This example is a flight over the volcano Aetna, 
Italy. 
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Fig. 6. HRSC (left), HRSC image from the Reichstag, Berlin (right) 



HRSC gj 

HRSC (High Resolution Stereo Scanner) has five stereo lines. In addition to 
the nadir line, two pairs of forward/backward lines with different stereo angles 
and four additional multispectral lines are assembled on the focal plate. Like 
WAOSS, HRSC was a part of the imaging payload of the Mars96 mission and 
was first flown on an airborne platform in 1997. 

HRSC is the first system which worked operationally. Together with the 
French ISTAR company city models were derived from digital airborne data. 

The LH- System ADS40 

The ADS40 is the first commercially available digital airborne stereo scanner 
and is a joint development between LH Systems and the German Aerospace 
Centre, DLR. This system is a real alternative to the familiar aerial film-based 
camera for a spatial resolution range (Ground Sample Distance) between 10 cm 
and 1 m. 

The camera has three panchromatic CCD lines of 2-12,000 pixels each, stag- 
gered by 3.25 mm and four multispectral CCD lines of 12,000 pixels each. 
Panchromatic image strips can therefore have more than 20,000 pixels in line 
direction and are comparable to the performance of aerial film-based camera for 
this application range. Imaging principles and resolution investigations for this 
staggered line approach can be found in jn|. 

The colour design of the camera was focused on multispectral applications. 
True colour must be derived from the measurements of multispectral lines. Con- 
trary to all other CCD-line scanners the RGB colour lines are optically superim- 
posed during the flight using a special dichroic beamsplitter. True colour images 
can be derived directly from the measured data. The near infrared channels are 
slightly offset with respect to the panchromatic nadir CCD lines. The following 
table shows the parameters of ADS40 more in detail. 
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Table 4. Large format CCD-matrices 





Photogrammetric Lines 


Spectral Lines 


Focal Length, f 


62.5 mm 


62.5 mm 


FOV (Across the Track) 


64° 


64° 


Number of CCD- Lines 


3 


4 


Elements per CCD- Line 
Stereo Angle 


2 • 12,000 


12,000 


Forward-Nadir 


26° 




Backward-Nadir 


16° 




Forward-Backward 


42° 




Dynamic Range 


12 bit 


12 bit 


Radiometric Resolution 


8 bit 


8 bit 


Flight Height 


3000 m 




Ground Sample Distance 
Swath Width 3000 m 


16 cm 


32 cm 


Data Compression Factor 


2.. .5 


2.. .5 


Data Compression Method 


JPEG, Loss Less 


JPEG, Loss Less 


Output Data Rate 


12. ..60 MWord/s 


6...30MWord/s 


Mass Memory for one hour 


< 200 Gbyte 


< 100 Gbyte 



The main features of the ADS40 are the staggered lines for the panchromatic 
stereo lines to achieve both requirements: Small ground sample distance and 
large swath width. The effect of staggering is shown in figure 7. The left image 
shows the image with a 12k linear sensor which corresponds to 6.5 mm and a 
ground sampling distance of about 20 cm (resampled to 10 cm). Because of the 
optics-limited frequency of about 150 Ip/mm, aliasing effects are visible. The 
right image shows the result of the processed 12k staggered sensor, resampled 
to 10 cm. The data rate is fourfold. In this image no aliasing effects are visible 
and the image has a visually better resolution. The Siemens star image does not 
change significantly, because this test chart measures the MTF of the imaging 
system. 

The following images shows the effect of radiometric zoom. The left image 
shows a balanced radiometry as a reference. In contradiction to the left image the 
right image is overdriving. In the shadowed part of the quadrangle structures can 
be differentiate, which are not visible in the left. This is a new possibility of the 
CCD cameras, which have a much better radiometry than film-based systems. 

Line scanner cameras are sensitive to attitude disturbances of the platform. 
The waveform of the building corner is a result of the aircraft movement. The 
exact measurement of the aircraft’s attitude allows a correction of this effect. 
This effect and the correction is shown in figure 9. 
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Fig. 7. ADS40 (left), testchart nnsteggered and staggered (right) 




Fig. 8. Effect of radiometric zoom (Reichstag, Berlin) 



5 High Resolution Imaging Sensors on Satellite Platforms 

These sensors are line sensors. Because of the extreme speed of the spacecraft (ca. 
7 km/s) the cycle time of the camera’s panchromatic channel must be between 0.1 
ms and 0.2 ms. To overcome radiometric problems (especially for multispectral 
channels) TDI-lines are used. Time delay and integrating sensors (TDI) for linear 
imaging are area sensors. The vertical CCD registers are clocked to ensure that 
the charge packets are transferred at the same rate and in the same direction as 
the image. This ensures that the signal charge building up in the CCD remains 
aligned under the same part of the image. In this way, the image signal can 
be integrated for much longer and this enhances the signal-to-noise ratio. The 
following table shows typical technical parameters of high resolution satellite 



sensors: 
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Fig. 9. Attitude correction of line scanner airborne imagery 



Table 5. Typical technical parameters of high resolution satellite snsors 



Ground Sample Distance 
Focal Length 
Stereo Mode 
Swath Width 
Altitude 
Orbit time 
Orbit type 
Repetition Rate 



1 m 

up to 10 m 

in-track and/or across-track 

6 to 36 km 

460 to 680 kilometers 

ca. 100 minutes 

sun-synchronous 

1 to 4 days 



The first commercial system for earth resource mapping was SPOT. The 
SPOT satellite Earth Observation System was designed by the ONES (Centre 
National d’Etudes Spatiales), France, and was developed with the participation 
of Sweden and Belgium. The SPOT system has been operational for more than 
ten years. SPOT 1 was launched on 22 February 1986 and was withdrawn from 
active service on 31 December 1990. SPOT 2 was launched on 22 January 1990 
and is still operational. SPOT 3 was launched on 26 September 1993. An accident 
on 14 November, 1996 disabled SPOT 3. Last satellite was SPOT 4, which was 
launched 1998. SPOT 5 is to be launched in late 2002. Ground sample distance 
of SPOT 1-4 is 10 m. With SPOT 5 a GSD of about 2.5 m will be reached. Apart 
from this simple nadir mapping stereo views are also possible. Oblique viewing 
of the SPOT system makes it possible to produce stereo pairs by combining 
two images of the same area acquired on different dates and at different angles 
(across-track stereo). Figure 10 shows this approach. 

The main disadvantage of this approach is that the investigated region is 
viewed under different illumination and weather conditions. 

Another possibility is the three line principle, which was described for air- 
borne sensors (in-track stereo). Because of the long focal length a single focal 
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Fig. 10. Stereoscopy with oblique viewing 



plane is impossible. Therefore three cameras are necessary and makes the sys- 
tem large and expensive. An example of this approach is MOMS-02 (Modularer 
Optoelektronischer Multispektraler Stereoscanner), which was flown on the D2 
shuttle mission 1993 and the MIR space station (since 1996). 

The third possibility is a sensor-tilt in and across track direction. The system 
first scans the patch in the forward view and after obtaining this measurement 
the mirror tilts to another view to measure the same patch under a different 
view angle. 

Since 1999 a new era of spaceborne imaging systems with commercial sys- 
tems, which have a GSD of better than 1 m have started to be used. The first 
successful system was IKONOS 0. Table 6 summarises these satellites. An in- 
teresting point is that the Indian IRS IC/D system, which has been operational 
since 1995, was the imaging system with the highest resolution up to the launch 
of IKONOS. 



Table 6. High resolution satellite sensors 



System Launch Company Stereo GSD/Swath/Height Comment 



IRS-IC/D 1995 


India (Gov) 


Across 


5m/70km/817km 


In operation 


EarlyBird 


1997 


Earthwatch 


In- / across 


3m/3km/600km 


Failed 


Ikonos 


1998 


Space Imaging In-/across 


lm/llkm/681km 


Failed 


Ikonos 


1999 


Space Imaging In-/across 


lm/llkm/681km 


In operation 


QuickBird 2000 


Earthwatch 


In- / across 


lm/22km/600km 


Not launched 


OrbView 


2000 


Orbimage 


In- / across 


lm/8km/470km 


Not launched 



The first real Im satellite is IKONOS. Therefore this system should be ex- 
plained more in detail. The digital camera system was designed and built by 
Eastman Kodak Company, Rochester, NY. Each camera can see objects less 
than one meter square on the ground. This capability from an orbital altitude 
of 680 km represents a significant increase in image resolution over any other 
commercial remote sensing satellite system. Figure 11 displays a part of the first 
image of IKONOS of the Washington memorial. 
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Fig. 11. First image of IKONOS, the Washington memorial 



The camera system of the IKONOS satellite is able to collect simultaneously 
panchromatic (grey-scale) imagery with one-meter resolution and multispectral 
data (red, green, blue, and near infrared) with four-meter resolution, across an 11 
km swath of the Earth’s surface. The panchromatic imagery will provide highly 
accurate Earth imagery, enabling geographic information system (GIS) users to 
generate precision maps. The multispectral data will have a variety of scientific 
applications, including environmental and agricultural monitoring. 

Airborne and spaceborne sensors complement each other. Spaceborne im- 
agery can not replace airborne imagery, because of limitations in ground resolu- 
tion and flexibility in the ’’orbit” choice. Certainly with spaceborne sensors it is 
possible to map regions which are not accessible by aeroplane. 

6 Multispectral Channels of Airborne and Spaceborne 
Scanners 

Multispectral channels, in addition to the stereo channels, are incorporated in the 
panchromatic and stereo channels. Multispectral imagery with high spatial reso- 
lution opens new remote sensing capabilities. Data fusion between the channels 
and other sensors together with additional digital elevation models derived from 
panchromatic stereo data create new scientific opportunities. Besides multispec- 
tral applications, true colour images become more important for photogrammet- 
ric applications and are to be derived from colour processed multispectral images. 
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Fig. 12. Spectral channels of selected airborne and spaceborne instruments 



When choosing multispectral bands narrow band filters on interesting spectral 
features are necessary. Therefore true colour images have to be derived from the 
multispectral channels. Figure 12 shows the spectral channels of selected air- 
borne and spaceborne instruments. For comparison a vegetation reflection curve 
is visualise. TM (Thematic Mapper) is the prototype for all multispectral sys- 
tems. The upper block (from TM to IKONOS) are spaceborne systems. Except 
IKONOS all systems are only multispectral systems. DPA, HRSC and ADS are 
airborne systems. 



7 Photogrammetric and Cartographic Data Processing 



The photogrammetric processing includes: 

~ digital image matching 

— digital surface model (DSM) and orthoimage generation 

— mosaicing and merging of multispectral data 

For geo-referencing, a combined INS and differential GPS based data processing 
routine is necessary. The main problem for processing stereo line scanner im- 
ages is that all image processing tasks (e.g. matching, etc.) are possible only in 
attitude corrected images. On the other hand, internal calibration and external 
orientation information is connected with the original disturbed dataset. There- 
fore each pixel needs a relation between the corrected and the disturbed image. 
Another problem is the matching in a non-epipolar geometry in the attitude 
corrected images where disparities in both image directions occur. 
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The first step is a combination of a hierarchical based feature and area based 
matching. The digital surface model (DSM) is derived from multiple ray inter- 
sections. Based on the DSM information, various cartographic products can be 
generated. The results are 

— 3 dimensional images and maps 

— True colour and multispectral images 

— Relief maps of the target area 

— Video animation of virtual flights over the target area 

A completely automatic photogrammetric and cartographic processing is possi- 
ble with existing stereo workstations. Adaptations for different scanner models 
are possible. 

8 Conclusions 

The article gives an overview about the existing high resolution imaging sys- 
tems on spaceborne and airborne platforms. Beside test and research systems 
like MOMS, commercial systems are also in operation. Since August 1999 im- 
age data of the high resolution spaceborne system IKONOS has been available. 
At the ISPRS2000 conference in July 2000 in Amsterdam, the airborne scanner 
ADS40 was introduced. The change into the digital domain from image gener- 
ation over image processing to data evaluation (GIS) is completed. Additional 
colour channels allow true colour images and multispectral data evaluation. 
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Abstract. This paper discusses the techniques of image acquisition for 
3D scene visualization and reconstruction applications (3DSVR). The 
existing image acquisition approaches in 3DSVR applications are briefly 
reviewed. There are still lacks of studies about what principles are es- 
sential in the design and how we can characterize the limitations of an 
image acquisition model in a formal way. This paper addresses some of 
the main characteristics of existing image acquisition approaches, sum- 
marized through a classification scheme and illustrated with many ex- 
amples. The results of the classification lead to general characterizations 
in establishing the notions (basic components) for design, analysis and 
assessment of image acquisition models. The notions introduced include: 
focal set, receptor set, reflector set etc. The formal definitions of the 
notions and the exploration of relationships among the components are 
given. Various examples are provided for demonstrating the flexibility 
and compactness in characterizing different types of image acquisition 
models such as concentric, polycentric, cataoptrical panoramas etc. The 
observations, important issues, and future directions from this study are 
also elaborated. 



1 Introduction 

Image acquisition is a process for obtaining data from real 3D scenes. The role of 
the image acquisition process has critical impacts on subsequent processes in 3D 
scene visualization and reconstruction (3DSVR) applications. The applications 
using panoramic images include, for instance, stereoscopic visualization IBEl 
1^ . stereo reconstruction |12ll4l21i33I37j . image-based rendering |4I1 31 19127^ . 
localization, route planning or obstacle detection in robot-navigation [I2EH1- 

An image acquisition model defines image-acquiring components and their 
usages in the image acquisition process for a particular application. The specifi- 
cations of image acquisition models typically differ between different applications 
except of some basic characterizations which will be discussed later. A scenario 
for developing an image acquisition model may go through the following steps: 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 81-^^ 2001. 
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(1) list the requirements and specifications of the application under investiga- 
tion; (2) sketch the problem(s) and possible solution(s) or approaches; (3) design 
an image acquisition model, involving pose planning, sensor design, illumination 
conditioning etc., which may lead to solutions and satisfaction of practical con- 
straints; (4) implement and test the image acquisition model. 

Conceptually, the closer the relation between the data acquired and the out- 
come expected in a 3DSVR application, the simpler the processes involved as 
well as the better the performance. QuicktimeVR 0 serves as a good example. 
However in reality physical constraints (such as temporal and spatial factors) 
and practical issues (e.g. cost, availability) that complicate the design and the 
realization of an image acquisition model for a 3DSVR application. Therefore it 
is important to study the constraints and issues as well as how they influence 
the design of image acquisition models. 

Since different image acquisition models result into different subsequent pro- 
cesses providing different characteristics in respect to both geometrical and pho- 
tometrical analysis, it is very risky developing 3DSVR applications without se- 
rious considerations of the suitability of the image data acquired for use in the 
intended application. Failures to assess the data may not only cause an unnec- 
essary complexity to the subsequent processes but even lead to an inability to 
fulfill the requirements of the application. 

Some researchers see a need for designing new image acquisition system(s) 
especially for 3DSVR applications. The point of view is that the traditional 
image acquisition models may/should not be able to serve all kinds of tasks in 
3DSVR applications. Researchers in the image-based rendering community have 
also noticed this need/inadequacy. They reconfigured some components from 
the traditional image acquisition models (e.g. a pinhole projection model with a 
pre-defined camera motion) and received some interesting results (i.e. novel view 
generations without 3D reconstructions) |5ll§l^6l51l3g| . However there are still 
lacks of studies about what principles are essential in the design and how we can 
characterize the limitations of an image acquisition model in a formal way. 

For being able to design, analyze and assess an image acquisition model, we 
need to establish the building-blocks (basic components) which construct the 
architecture of image acquisition models. In the next section, the classification 
of recent image acquisition approaches for 3DSVR applications is presented. The 
results of the classification characterize the existing image acquisition approaches 
and lead to general characterizations in Section 3 where we introduce some 
basic/general components/notions for design, analysis and assessment of image 
acquisition models. The observations, important issues, and future directions 
from this study are addressed in the conclusion. 
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Fig. 1. The examples of the first two classification rules: visual field and focal points 
associated with each image. See text for details. 



2 Classification 

For simplicity, this paper mainly focuses on passive imaging and leaves the fac- 
tors introduced by active imaging to be incorporated later. To avoid the problems 
where the resulting classification becomes untraceable or lost in excessive detail, 
four binary classification rules are used. They are defined as follows: 

Visual field: Circular/Non-circular 

Focal point(s) associated with each image: Single/Multiple 
Acquiring time: Different/Same 
Acquiring pose: Different/Same 

In Fig.l the examples of the intermediate classes from the first two classifi- 
cation rules are depicted. A planar image and a cylindrical image (Fig.l a and 
b) are examples for the non-circular and the circular visual field classes. A pin- 
hole projection model (Fig.l c) is the example for the class of a planar image 
associated with a single focal point. For the class of a planar image associated 
with multiple focal points the three examples (Fig.l d) are: each image column 
associates with a focal point (left); each image row associates with a focal point 
(middle); and each image pixel associates with a focal point (i.e. orthographic 
projection) (right). A central-projection panorama (Fig.l e) is the example for 
the class of a cylindrical image associated with a single focal point. For the class 
of a cylindrical image associated with multiple focal points the three examples 
(Fig.l f) are: each image column associates with a focal point (left); each image 

^ The terms, active imaging and passive imaging, are frequently used to distinguish 
whether or not equipment (such as lighting device or laser) is acting on the physical 
scene while carrying out image acquisition m- 
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row associates with a focal point (middle); and each image pixel associates with 
a focal point (right). 

Note that the temporal classification rule may be used in a flexible way, that 
is, the acquiring time can be conceptually rather than physically the same. For 
example, a binocular stereo pair, used in stereo matching, acquired in a static 
scene under an almost constant illumination condition can be regarded concep- 
tually as having the same acquiring time. For the acquiring pose classification 
rule, two other characteristics, translation and rotation, may be used to further 
specify the possible classes. 



Now let us look at some examples of the existing image acquisition approaches 
in the different classes. The conditions of each class are specified in Tab.l A typ- 
ical example for class 1 is a video camera acquiring a video sequence. A video 
surveillance system is an example for class 2 because it fixes the image acquisi- 
tion system at a particular place. In class 3, the binocular stereo pair acquisition 
is a typical example. More recently approaches such as in light field m and 
Lumigraph also fall into class 3, where both approaches arrange the poses of 
a pinhole camera to a planar grid layout and assume a constant acquiring time 
for their applications. Re-sampling the acquired image data, which are parame- 
terized into a 4D function, generates a novel view for use in their visualization 
applications. Other examples in class 3 are j2ll dl 1 7121)128181)] . In class 4, sin- 
gle still image capturing, currently the most common image acquisition model, 
serves as an example. 



In class 5, the image acquisition model used in three light-sources photo- 
metric stereo method m for 3D reconstruction is one of the examples, where 
the orthographic projection (non-circular visual field and multiple focal centers); 
with only one of light sources on for each image capturing (different acquiring 
time); and multiple views (different acquiring poses) for a full 3D reconstruc- 
tion are used. Another example in class 5 is reported in 1251 . Considering 2.5D 
reconstruction of a 3D scene from a single viewing pose, a well-known passive 
approach, depth from de/focusing, is usually adopted. An example of such an 
approach which falls into class 6 is the multi-focus camera with a coded aperture 
proposed in 0. 

For class 7, an example is a three-line scanner system (pushbroom camera 
0 on an airplane) used in the terrain reconstruction or heat-spot sensing \ll\ 
ES]- This setup is characterized by (1) non-circular visual field; (2) each image 
column associates with a focal point, (3) acquiring time is conceptually the same 
as performing the stereo matching; and (4) the poses of each line-scanner are 
inherently different. 

Now we look at some examples of the approaches with circular visual field. 
In class 9, there are quite a few systems already developed in research institutes 
or commercialized in industry isni. One of the applications for this so-called 
dynamic-panorama-video system is to allow the user to visualize a real scene 
by virtually walking along a path (where data is acquired) and looking around 
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Table 1. The 16 proposed classes of image acquisitions for 3D scene visualization and 
reconstruction applications with their conditions and the selected examples. Note that 
indicates that the authors have not been aware of existing examples; whereas “[*]’ 
means that the class is very common and there are too many examples to be cited 
here. 
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360 from any point in the patlfl For instance, to visualize the interior of a 
building (i.e. walking through corridor, lobby, rooms, etc), the path is planed and 
a robot ( on which a dynamic-panorama-video system is installed) implements 
the image acquisition. An interactive explorej^ needs to be developed, providing 
an interface for the user to explore the interior of a building. Some image-based 
rendering techniques can be used to interpolate the missing data (i.e. gaps/holes 
in synthesized image) or extrapolate the acquired data to some extent such that 
the viewing space can be expanded to a certain degree. 

The applications (e.g. environmental study; surveillance, etc) requiring a 
panoramic-video system to be fixed at one position for a long period of time are 
members of class 10 (i.e. static-panorama-video system). For example, a system 
was deployed on the bank of a lake (West-lake, China) for monitoring environ- 
mental change and acquiring 360° panorama continuously for the whole year in 
1998. Some examples of the surveillance application can be found in mu. 

^ The density of view points on a path depends on the frame rate of the video camera 
and the speed of the movement during the image acquisition. 

^ A software that synthesizes a view (image) via resampling the acquired data accord- 
ing to current user’s viewing condition. 
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A well-known example for a single-focal-point panorama is QuickTimeVR 
1^ from Apple Inc., which falls into class 12. Using multiple single- focal-point 
panoramas to reconstruct a 3D scene, class 11, S.B. Kang and R. Szeliski re- 
ported their results in m Other similar examples are lslU)ti2l . The families of 
cataoptrical panorama used for 3D scene reconstruction [31712313^ mostly be- 
long to class 11, except for the configuration: pinhole projection model with a 
spherical mirror in which case it goes to class 15. 



H. Ishiguro et al. first proposed an image acquisition model that is able to 
produce multiple panoramas by a single swiveling of a pinhole-projection camera, 
where each panorama is associated with multiple focal points. It is of class 15. 
The model was created for the 3D reconstruction of an indoor environment. 
Their approach reported in 1992 in H21 already details essential features of the 
image acquisition model. The modifications and extensions of their model have 
been discussed by other works such as |l()tj4tlll32l33! . Other examples of this 
class are [I2,5l36j . 



3 Characterization 



In this section we introduce some notions in an abstract level for characterizing 
the essential components of image acquisition models used in 3DSVR appli- 
cations. The notions include focal set IF, receptor set S, projection-ray set hi, 
reflector set TZ, reflected-ray set V, plus temporal and spatial factors. The for- 
mal definitions are given followed by the exploration of relationships among the 
components. Various examples are provided for demonstrating the flexibility and 
compactness in characterizing different types of image acquisition models. 

Definition 1. A focal set iF is a non-empty (finite) set of focal points in 3D 
space. A focal point, an element of IF, can be represented as a 3-vector in K^. 

Definition 2. A receptor set 5 is a non-empty infinite or finite set of receptors 
(photon-sensing elements) in 3D space. A receptor, an element of S, can be 
characterized geometrically as 3- vectors in 

In practice, a focal set T contains a finite number of focal points, but a 
receptor set S may either have an infinite or finite number of receptors depending 
on the type of photon-sensing device used. For instance, the radiational film 
(negative) is regarded as containing infinite many photon-sensing elements; and 
the CCD chip in a digital camera contains only a finite number of photon-sensing 
elements. 

It is convenient to express a collection of points by a supporting geometric 
primitive such as a straight line, curve, plane, quadratic surface etc. where all of 
the points lie on. For examples, the pinhole projection model consists of a single 
focal point (i.e. the cardinality of the focal set is equal to 1 or formally #(lF) = 
1) and a set of coplanar receptors. The orthographic projection model consists of 
a set of coplanar focal points and a set of coplanar receptors. The single-center 
panoramic image model (e.g. QuickTimeVR) consists of a single focal point and 
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a set of receptors lie on a cylindrical or spherical surface. The multi-perspective 
panoramic image model consists of a set of focal points on various geometrical 
forms (such as a vertical straight line, a 2D circular path, a disk, or a cylinder 
etc.) and a set of receptors lie on a cylinder. 

Space is filled with a dense volume of light rays of various intensities. A sin- 
gle light ray with respect to a point in 3D space at one moment of time can be 
described by seven parameters, that is, three parameters describing the point’s 
location, two parameters describing the ray’s emitting angle, one parameter de- 
scribing the wavelength of the light in the visible spectrum, and one parameter 
describing the time. A function taking these seven parameters as inputs and out- 
putting a measure of the intensity is called plenoptical function P^. All possible 
light rays in a specified 3D space and time interval form a light field, denoted as 
C. 

The association between focal points in T and receptors in S determines a 
particular proper subset of the light field. For instance, a complete bipartite set 
of focal and receptor sets is defined as 

S.FxS = {(P,?) --P^T and q G 5}, 

where each element (p, q) specifies a light ray passing through the point p and 
striking on point q. Note that a complete bipartite set of focal and receptor sets 
is a proper subset of the light field (i.e. Bj^xS C C). 

Definition 3. A focal-to-receptor association rule defines an association be- 
tween a focal point and a receptor, where a receptor is said to be associated 
with a focal point if and only if any light ray which is incident with the receptor 
passes through the focal point. 

Each image acquisition model has it’s own association rule for the focal and 
receptor sets. Sometimes, a single rule is not enough to specify complicate associ- 
ating conditions between the two sets, thus a list of association rules is required. 
A pair of elements satisfies a list of association rules if and only if the pair 
satisfies each individual association rule. 

Definition 4. A projection-ray set U is a, non-empty subset of the complete 
bipartite set of focal and receptor sets (i.e. U C Bj^xs C £), which satisfies the 
following conditions: 

1. It holds (p, q) GU a and only if (p, q) satisfies a (list of) pre-defined associ- 
ation rule(s); 

2. For every pG T , there is at least a q G S such that (p, q) GU] 

3. For every q G S, there is at least a p G J- such that (p, q) G U. 



For example, the projection-ray set lA of the traditional pinhole projection 
model is the complete bipartite set of focal and receptor sets, because there 
is only a single focal point and every receptor defines a unique projection-ray 
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through the focal point. Moreover, the projection-ray set in this case is a proper 
subset of the penciB of rays at that focal point. 

The projection-ray set U of a multi-perspective panoramic image acquisition 
model is a subset of the complete bipartite set of focal and receptor 

sets and can be characterized formally as follows. The focal points in T are an 
ordered finite sequence, pi,p 2 , • ■ • which all lie on a ID circular path in 3D 
space. The set of receptors form a uniform (orthogonal) 2D grid and lie on a 2D 
cylindrical surface that is co-axial to the circular path of the focal points. The 
number of columns of the grid is equal to n. The association rules determining 
whether (p, q) belongs to the projection-ray set U are as follows: 

1. All q G S which belong to the same column must be assigned to an unique 
Pi G fF. 

2. There is an ordered one-to-one mapping between the focal points pi G fF 
and the columns of the grid. In other words, the columns of the grid, either 
counterclockwise or clockwise, may be indexed as Ci,C 2 ,...,c„ such that 
every q G Ci is mapped to pi,i G [l..n]. 

Definition 5. A reflector set 7?. is a set of reflectors’ surface equations, usually 
a set of first or second order continuous and differentiable surfaces in 3D space. 

A reflector set, e.g. mirror(s), is used to characterize how light rays can be 
captured indirectly by the receptors. For instance, a hyperbolic mirror is used in 
conjunction with the pinhole projection model for acquiring a wide visual field of 
a scene (e.g. 360° panorama). Similarly, with the orthographic projection model, 
the parabolic mirror is adopted. Such type of image acquisition model allows that 
all the reflected projection rays intersect at the focus of the hyperboloid . 
which possess a simple computational model which supports possible 3DSVR 
applications. 

Let V{TZ) denote the power set of the reflector set. Define a geometrical 
transformation T as follows: 

T :Ux V{n) A, 

{{p,q),s) ^ {p',q'), 

where A is a non-empty subset of the light felid. The element of A, a light ray, is 
represented by a pair of points, denoted as (p', q'), specifying its location and the 
orientation. The transformation T is a function which transforms a projection 
ray with respect to an element of V{TZ) to a reflected ray. 

Definition 6. A reflected-ray set V is a non-empty set of light rays, which is a 
subset of the light field. Formally, 

V = {T{{p,q),s) : {p,q) G U}, 

where s is one particular element of the power set of a reflector set (i.e. s G V{TZ)). 



The set of all rays passing through one point in space is called a pencil. 
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Note that, when a transformation of a projection-ray set takes place, only 
one element of V{'R) is used. In particular, as 0 G V{TVj is chosen, the resulting 
reflected-ray set is identical to the original projection-ray set. When the number 
of elements of the chosen s is more than one, the transformation behaves like 
ray-tracing. 

A single projection-ray set (or a reflected-ray set - we omit to repeat this in 
the following) is referred to as a set of light rays defined by an image acquisition 
model at a moment of time and a specific location. Two factors are added to 
characterize multiple projection-ray sets. The temporal factor describes the ac- 
quisition time, and the spatial factor describes the pose of the model. A collection 
of (or multiple) projection-ray sets is denoted as {Ut^p}, where t and p indicating 
time and pose, respectively. Multiple images, i.e. a collection of projection-ray 
sets acquired at different times or poses are a subset of the light held. 

Some 3DSVR applications use only a single projection-ray set to approximate 
a complete light held in a restricted viewing zone and some require multiple 
images in order to perform special tasks such as depth from stereo. Regardless 
of the time factor, to acquire a complete light field of a medium-to-large scale 
space is already known to be very difficult, or say, almost impossible to achieve 
based on the technology available to date. Usually, a few sampled projection-ray 
sets are acquired for approximating a complete light held. Due to the nature of 
scene complexity, the selection of a set of optimal projection-ray samples become 
an important factor to determine the quality of the approximation of a complete 
light held of a 3D scene. 



4 Conclusions 

This paper discusses image acquisition approaches for 3D scene visualization and 
reconstruction applications. The importance of the role of image acquisition and 
the impacts to the subsequent processes in developing a 3DSVR application are 
addressed. It may be risky and inappropriate both in research and developments 
of 3DSVR applications if we do not consider the possibility that other/better 
image acquisition models might exist. 

We designed and applied a classiflcation scheme for the existing image ac- 
quisition approaches for 3DSVR applications. Some existing image acquisition 
approaches in 3DSVR applications are briefly reviewed. The results of the clas- 
siflcation lead to general characterizations in establishing notions (basic compo- 
nents) for design, analysis and assessment of image acquisition models. 

In future we will look further into the relationship between applications and 
image acquisition models. Given some conditions with respect to a particular 
3DSVR application: (1) what is the capability and limitation of an image ac- 
quisition model; and (2) what criteria should be used to evaluate the developed 
image acquisition model in respect to its application? 

An extension of this study could develop into an approach that automati- 
cally generates (optimal) solution(s) of image acquisition models satisfying the 
image acquisition requirements from a 3DSVR application. The success of the 
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model has direct practical benefits for 3DSVR applications. With respect to a 
theoretical aspeciij it helps us to understand what image acquisition can support 
a 3DSVR application (capability analysis) as well as how far the support may 
go (limitation analysis). 
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Abstract. The structure multivector is a new approach for analyzing 
the local properties of a two-dimensional signal (e.g. image). It combines 
the classical concepts of the structure tensor and the analytic signal 
in a new way. This has been made possible using a representation in 
the algebra of quaternions. The resulting method is linear and of low 
complexity. The filter-response includes local phase, local amplitude and 
local orientation of intrinsically one-dimensional neighborhoods in the 
signal. As for the structure tensor, the structure multivector field can be 
used to apply special filters to it for detecting features in images. 



1 Introduction 

In image and image sequence processing, different paradigms of interpreting the 
signals exist. Regardless of they are following a constructive or an appearance 
based strategy, they all need a capable low-level preprocessing scheme. The anal- 
ysis of the underlying structure of a signal is an often discussed topic. Several 
capable approaches can be found in the literature, among these the quadrature 
filters derived from the 2D analytic signal |Zj, the structure tensor pE) . and 
steerable filters jS]. 

Since the preprocessing is only the first link in a long chain of operations, it is 
useful to have a linear approach, because otherwise it would be nearly impossible 
to design the higher-level processing steps in a systematic way. On the other 
hand, we need a rich representation if we want to treat as much as possible 
in the preprocessing. Furthermore, the representation of the signal during the 
different operations should be complete, in order to prevent a loss of information. 
These constraints enforce us to use the framework of geometric algebra which is 
also advantageous if we combine image processing with neural computing and 
robotics (see P). 

* This work has been supported by German National Merit Foundation and by DFG 
Graduiertenkolleg No. 357 (M. Felsberg) and by DFG Grant So-320-2-2 (G. Som- 
mer). 
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In the one-dimensional case, quadrature filters are a frequently used approach 
for processing data. They are derived from the analytic signal by bandpass fil- 
tering. The classical extension to two dimensions is done by introducing a pref- 
erence direction of the Hilbert transform [7| and therefore, the filter is not very 
satisfying because the orientation has to be sampled. 

The alternative approach is to design a steerable quadrature filter pair |^, 
which needs an additional preprocessing step for estimating the orientation. As 
a matter of course, this kind of orientation adaptive filtering is not linear. 

The structure tensor (see e.g. |H|) is a capable approach for detecting the 
existence and orientation of local, intrinsic one-dimensional neighborhoods. From 
the tensor field the orientation vector field can be extracted and by a normalized 
or differential convolution special symmetries can be detected (E| ■ The structure 
tensor can be computed with quadrature filters but the tensor itself does not 
possess the typical properties of a quadrature filter. Especially the linearity and 
the split of the identity is lost, because the phase is neglected. 

In this paper, we introduce a new approach for the 2D analytic signal which 
enables us to substitute the structure tensor by an entity which is linear, pre- 
serves the split of the identity and has a geometrically meaningful representation: 
the structure multivector. 

2 A New Approach for the 2D Analytic Signal 

2.1 Ftindamentals 

Since we work on images, which can be treated as sampled intervals of we use 
the geometric algebra M 0,2 which is isomorphic to the algebra of quaternions H. 
The whole complex signal theory naturally embeds in the algebra of quaternions, 
i.e. complex numbers are considered as a subspace of quaternions here. The basis 
of the quaternions reads {l,i,j,k} while the basis of the complex numbers reads 
Normally, the basis vector 1 is omitted. 

Throughout this paper, we use the following notations: 

— vectors are bold face, e.g. x = x\i -I- Xi3 

— the Fourier transform is denotecQ /(a:) o-* F(u) = J f{x) exp{i2iTU ■ x) dx 

— the real part, the i-part, the j-part, and the fc-part of a quaternion q is 
obtained by 'R,{q}, T{g}, J{q}, and /C{g}, respectively 

The ID analytic signal is defined as follows. The signal which is obtained 
from f{x) by a phase shift of tt/ 2 is called the Hilbert transform fnix) of f{x). 
Since fnix) is constrained to be real- valued, the spectrum must have an odd 
symmetry. Therefore, the transfer function has the forrrQ H{u) = isign(M). If 
we combine a signal and its Hilbert transform corresponding to 



^ Note that the dot product of two vectors is the negative scalar product 
{x ■ u = —{x, u)). 

^ Since we use vector notation for ID functions, we have to redefine some real-valued 
functions according to sign(u) = sign(u), where u = ui. 



Structure Multivector for Local Analysis of Images 



95 



fA{x) = f{x) - fH{x)i , 



( 1 ) 



we get a complex- valued signal, which is called the analytic signal of f{x). 

According to the transfer function of the Hilbert transform, the Fourier trans- 
form of the analytic signal /a(®) is located in the right half-space of the fre- 
quency domain, i.e. fA{x) o—»2F{u)S-i{u) (i5_i: Heaviside function). 

2.2 The 2D Phase Concept 

We want to develop a new 2D analytic signal for intrinsically ID signals (in 
contrast to the 2D analytic signal in | 2 | which is designed for intrinsically 2D 
signals), which shall contain three properties: local amplitude, local phase and 
local orientation. Compared to the ID analytic signal we need one additional 
angle. We cannot choose this angle without constraints: if the signal is rotated 
by 7T, we obtain the same analytic signal, but conjugated. Therefore, we have 
the following relationship: negation of the local phase is identical to a rotation 

of 7T. 

Note the difference between direction and orientation in this context; the 
direction is a value in [0; 27t) and the orientation is a value in [0; tt). 

Any value of the 2D analytic signal can be understood as a 3D vector. The 
amplitude fixes the sphere on which the value is located. The local phase corre- 
sponds to rotations on a great circle on this sphere. To be consistent, a rotation 
of the signal must then correspond to a rotation on a small circle (local orienta- 
tion) . 



The coordinate system defined in this way is displayed in figure H It is the 
same as in |Hj, but Granlund and Knutsson use the 2D phase in the context of 
orientation adaptive filtering. 







Fig. 1. Coordinate system of the 2D phase approach 



96 



M. Felsberg and G. Sommer 



The angles ip and 9 are obtained by 0 = | arg((I{g} + J^{q}i)^) {local ori- 
entation) and (fi = arg(7?.{g} — {J{q} — X{q}i)e~'^^) {local phase), where g G H 
with lC{q} = 0 and arg(z) G [0; 27t). 

Note that this definition of the quaternionic phase is different from that of 
the quaternionic Fourier transform (see e.g. | 2 |). The reason for this will be 
explained in section 



2.3 The Monogenic Signal 

Now, having a phase concept which is rich enough to code all local properties 
of intrinsically ID signals, we construct a generalized Hilbert transform and an 
analytic signal for the 2D case, which make use of the new embedding. 

The following definition of the Riesz transforrr0 is motivated by theorem Q 
which establishes a correspondence between the Hilbert transform and the Riesz 
transform. The transfer function of the Riesz transform reads 

H{u) = ^ , ( 2 ) 

and Fh{u) = H{u)F{u) t-o fjj{x). 



Example: the Riesz transform of f{x) = cos(27tmo • x) is 

Jh{x) = — exp(fc0o) sin(27ri6o • x) where 0q = arg(«o)- 
Obviously, the Riesz transform yields a function which is identical to the ID 
Hilbert transforms of the cosine function, except for an additional rotation in 
the i — j plane (the exponential function). 

Up to now, we have only considered a special example, but what about 
general signals? What kind of signals can be treated with this approach? The 
answer can be found easily: the orientation phase must be independent of the 
frequency coordinate. This sounds impossible, but in fact, the orientation phase 
is constant, if the spectrum is located on a line through the origin. 

Signals which have a spectrum of this form are intrinsically ID (i.e. they are 
constant in one direction). This is exactly the class of functions the structure 
tensor has been designed for and we have the following theorem: 

Theorem 1 Let f{t) be a one- dimensional function with the Hilbert trans- 
form fH{t)- Then, the Riesz transform of the two-dimensional function f'{x) = 
f{{x ■ n)i) reads f'u{x) = —nifH{{x ■ n)i), where n = cos{6)i + sin(0)j is an 
arbitrary unit vector. 

Now we simply adapt d) for the 2D case and obtain the monogenic signal of a 2D 
signal. Using this definition, we obtain for our example: = exp(|^27rito • x). 

® Originally, we used the term spherical Hilbert transform in 0. We want to thank 
T. Billow for alluding to the existence of the Riesz transform and for giving us the 
reference ^01 which enabled us to identify the following definition with it. 
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Hence, the monogenic signal uses the phase concept, which has been defined 
in section 12.21 According to theorem ^ the monogenic signal of an intrinsically 
one-dimensional signal f'{x) = f((x ■ n)i) reads 

f'A{x) = f{{x-n)i)-nfH{{x-n)i) . (3) 

Of course, the monogenic signal can be computed for all functions which 
are Fourier transformable. However, for signals which do not have an intrinsic 
dimension of on^ the correspondence to the ID analytic signal is lost. 

Independently of the intrinsic dimensionality of the signal, the analytic signal 
can also be calculated in a different way. The ID analytic signal is obtained in 
the Fourier domain by the transfer function 1 -|- sign(it). For the monogenic 
signal we have the same result if we modify the Fourier transform according to 
f{x) = exp(fc0/2)F"(tt) exp(i27rit • x) exp{—k9/2) du, the inverse spherieal 
Fourier transform. Then, we have Ja{x) = f{x) (see ^). 

Since the integrand of f{x) is symmetric, we can also integrate over the half 
domain and multiply the integral by two. Therefore, we can use any transfer 
function of the form 1 -|- sign(it • n) without changing the integral. By simply 
omitting half of the data, the redundancy in the representation is removed. 

In order to calculate the energy of the monogenic signal, we need the transfer 
function, which changes F{u) to Fa{u): it is obtained from O and © and reads 
1 — = 1-1- cos(0) -I- sin(0)fc. The energy of the monogenic signal is 




1(1 -I- cos(0) -I- sin(0)fe)F"(«)p du = 




\F{u)\'^ du , 



( 4 ) 



i.e. it is two times the energy of the original signafl. 

From the group of similarity transformations (i.e. shifts, rotations and dila- 
tions) only the rotation really affects the monogenic signal; the orientation phase 
is changed according to the rotation. If we interpret the monogenic signal as a 
vector field in 3D (see also section the group of 2D similarity transform^ 
even commutes with the operator that yields the monogenic signal. 

The reader might ask, why do we use a quaternion-valued spectral approach 
which differs from the one of the QFT (see e.g. ( 2 |). The reason is not obvious. 
The QFT covers more symmetry concepts than the complex Fourier transform. 
The classical transform maps a reflection through the origin onto the conjugation 
operator. The QFT maps a reflection in one of the axes onto one of the algebra 
automorphisms. We can use C® C instead of HI to calculate the QFT. Moreover, 

The case of intrinsic dimension zero (i.e. a constant signal) is irrelevant, because the 
Hilbert transform is zero in both cases. 

® This is only valid for DC free signals. The energy of the DC component is not doubled 
as in the case of the ID analytic signal. 

® Note that in the context of a 3D embedding, the group of 2D similarity transforms 
is the subgroup of the 3D transforms restricted to shifts in i or j direction, rotations 
around the real axis and a dilation of the i and j axes. 
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we have C ® C = (see e.g. 0) and consequently, we can calculate the QFT 
of a real signal by two complex transformations using the formula 



and the symmetry wrt. the axes is obvious. 

In this paper, we want to present an isotropic approach which means that 
symmetry wrt. the axes is not sufficient. Therefore, we had to design the new 
transform. The design of isotropic discrete filters is a quite old topic, see e.g. 0. 

3 Properties of the Monogenic Signal 

3.1 The Spatial Representation 

The definition of the Riesz transform in the frequency domain can be transformed 
into a spatial representation. The transfer function O) can be split into two 
functions: ^ and The only thing left is to calculate the inverse Fourier trans- 
form of these functions. In m the transform pairs can be found: 2 t^\x\^ 



The functions „ ,3 and „ ,3 are the kernels of the 2D Riesz transform in 

vector notation. From a mathematician’s point of view, the Riesz transform is 
the multidimensional generalization of the Hilbert transform. Consequently, the 
monogenic signal is directly obtained by the convolution 



The graph in figure El sums up all ways to calculate the monogenic signal from 
the preceding sections. The inverse spherical Fourier transform is denoted 

3.2 The Structure Multivector 

Normally, images are intrinsically two-dimensional, so the concepts described in 
section 1^1 cannot be applied globally. On the other hand, large areas of images 
are intrinsically one-dimensional, at least on a certain scale. Therefore, a local 
processing would take advance of the new approach. 

The classical approach of the analytic signal has its local counterpart in the 
quadrature filters. A pair of quadrature filters (or a complex quadrature filter) is 
characterized by the fact that the impulse response is an analytic signal. On the 
other hand, both impulse responses are band-limited and of finite spatial extent, 
so that the problem of the unlimited impulse response of the Hilbert filter is 
circumvented. 

An example of the output of a ID quadrature filter can be found in figure 0 



/\ /\1 — ^ / . *\1 T ^ 

Fqiu) = F(m)^— -1-F(miX-U2J)^— 



( 5 ) 
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Fig. 3. Upper left: impulse, upper right: magnitude of filter output, lower left: real part 
of filter output, lower right: imaginary part of filter output 
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While the representation in figure|3is very common, we will introduce now a 
different representation. One-dimensional signals can be interpreted as surfaces 
in 2D space. If we assign the real axis to the signal values and the imaginary axis 
to the abscissa, we obtain a representation in the complex plane. The analytic 
signal can be embedded in the same plane - it corresponds to a vector field which 
is only non-zero on the imaginary axis (see figure 0. 



0.2 

0 

- 0.2 



Fig. 4. Representation of the ID analytic signal as a vector field 
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Fig. 5. Siemens-star convolved with a spherical quadrature filter (magnitude) 



Based on the monogenic signal, we introduce the spherical quadrature filters. 
They are defined according to the ID case as a hypercomplex filter whose impulse 
response is a monogenic signal. 

It is remarkable that the spherical quadrature filters have isotropic energy 
and exactly choose the frequency bands they are designed for. In figure 0 it 
can be seen that the energy is isotropic and that it is maximal for the radius 
77.8, which corresponds to a frequency of The used bandpass has a center 
frequency of -h. 
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Fig. 6. From left to right: impulse-line; filter output: amplitude, real part, combined 
imaginary parts 



In figure El the output of a spherical quadrature filter applied to an impulse- 
line is displayed. The lower right image shows the combined imaginary parts 
which means that instead of the imaginary unit i the unit vector n is used. 

Same as for ID signals, 2D signals can be embedded as a surface in 3D 
space. The signal values are assigned to the real axis and the spatial coordinates 
to the i- and the ji-axis. The monogenic signal can be represented in the same 
embedding. It corresponds to a vector field, which is only non-zero in the plane 
spanned by i and j (see figure P) . 




Fig. 7. Representation of the monogenic signal as a vector field 



The result of filtering a signal with a spherical quadrature filter is a quater- 
nion-valued field. Though the fc-component {bivector) of the field is always zero, 
we denote this field as a multivector field or the structure multivector of the 
signal. As already the name induces, the structure multivector is closely related 
to the structure tensor. The structure tensor as defined in jS| mainly includes 
the following information: the amplitude as a measurement for the existence of 
local structure and the orientation of the local structure. 
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Jahne |H1 extracts an additional information: the coherence. The coherence 
is the relationship between the oriented gradients and all gradients, so it is a 
measurement for the degree of orientation in a structure and it is closely related 
to the variance of the orientation. The variance is a second order property. It 
includes a product of the arguments and therefore, it is not linear. Consequently, 
the coherence cannot be measured by a linear approach like the structure mul- 
tivector. Two structures with different orientations simply yield the vector sum 
of both multi vector fields. 

The structure multivector consists of three independent components (local 
phase, local orientation and local amplitude) and it codes three properties. Con- 
sequently, there is no additional information possible. The structure tensor pos- 
sesses three degrees of freedom (it is a symmetric tensor). Therefore, apart from 
the amplitude and the orientation one can extract a third information, the co- 
herence. 

4 Experiments and Discussion 

4.1 Experiments 

For the computation of the structure multivector we use a multi-scale approach, 
i.e. we couple the shift of the Gaussian bandpass with the variance as for the 
Gabor wavelets. 

For the experiments, we chose some synthetic examples with letters as gray 
level or textured images. 




(a) (b) 

Fig. 8. (a) Structure multivector of an image without texture. Upper left: original, 
upper right: amplitude, lower left: (/?-phase, lower right: 0-phase. 

(b) Structure multivector of an image with one texture. Upper left: original, upper 
right: amplitude, lower left: yi-phase, lower right: 0-phase 



In figure Ela) it can be seen that the structure multivector responds only at 
the edges. Therefore, the amplitude is a measure for the presence of structure. 
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The (/3-phase is linear, which can only be guessed in this representation. But it 
can be seen that the (/3-phase is monotonic modulo a maximal interval (which 
is in fact 27 t). The 0-phase represents the orientation of the edge. Note that the 
highest and the lowest gray level (standing for tt and zero, respectively) represent 
the same orientation. 

In figure IHTb) it can be seen that the structure multivector responds also 
inside the object. The amplitude is nearly constant, which corresponds to a tex- 
ture with constant energy. Of course, this property is lost if the wrong scale is 
considered. The (/3-phase is linear, see notes above. The 0-phase represents the 
orientation of the texture. A constant gray level corresponds to a constant esti- 
mated orientation. The small spikes in the figure are produced by the extraction 
of the local orientation angle. The underlying quaternion- valued field does not 
show these artifacts. 




(a) (b) 



Fig. 9. (a) Structure multivector of an image with two superposed textures. Upper 
left: original, upper right: amplitude, lower left: (p-phase, lower right: 0-phase. 

(b) Structure multivector of an image with two superposed textures and textured 
background. Upper left: original, upper right: amplitude, lower left: (/9-phase, lower 
right: 0-phase 



In figure|3^a) it can be seen that the structure multivector responds only with 
respect to the dominant texture (the one with higher frequency) . The magnitude 
of the response is modulated with that component of the weaker texture that is 
normal to the dominant texture. This effect is even more obvious in figure EJb). 
The (/3-phase is always directed parallel to the dominant texture. The 0-phase 
represents the orientation of the dominant texture in each case. 

4.2 Conclusion 

We have presented a new approach to the 2D analytic signal: the monogenic 
signal. It has an isotropic energy distribution and deploys the same local phase 
approach as the ID analytic signal. There is no impact of the orientation on the 
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local phase which is one of the most important drawbacks of the classical 2D 
analytic signal. 

Additionally, the monogenic signal includes information about the local ori- 
entation and therefore, it is related to the structure tensor. On the other hand, 
there are two differences compared to the latter approach: the structure multi- 
vector does not include coherence information and it is linear. 

The local counterpart to the monogenic signal is the structure multivector. 
The latter is the response of the spherical quadrature filters which are a general- 
ization of the ID quadrature filters. From the structure multivector one obtains 
a stable orientation estimation (as stable as the orientation vector field of the 
structure tensor). In contrast to the classical quadrature filters, the spherical 
quadrature filters do not have a preference direction. Therefore, the orientation 
need not be sampled or steered. 

We introduced an interpretation technique for the analytic and the monogenic 
signal in form of vector fields. Furthermore, we tried to explain the impact of 
simple structures and textures on the structure multivector. Applications can 
easily be designed, e.g. texture segmentation (see also | 2 |). 
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Abstract. We consider the potentialities of matching multiple views of a 
3D scene by the least square correlation provided that relative projective 
geometric distortions of the images are affinely approximated. The affine 
transformation yielding the (sub)optimal match is obtained by combining 
an exhaustive and directed search in the parameter space. The directed 
search is performed by a proposed modification of the Hooke-Jeeves 
unconstrained optimization. Experiments with the RADIUS multiple- 
view images of a model board show a feasibility of this approach. 



1 Introduction 

Generally, the uncalibrated multiple- view 3D scene reconstruction involves a set 
of images with significant relative geometric distortions (because of different 
exterior and interior parameters of cameras ised for image acquisition). This 
complicates the search for initial stereo correspondences for starting an iterative 
process of simultaneous cameras calibration and 3D surface recovery ISEE). If 
the images form a sequence such that each neighbouring pair has rather small 
geometric deviations, then the search for correspondences is usually reduced to 
detection of identical points-of-interest (POI) such as corners ra . But generally 
due to significant geometric and photometric image distortions, the identical 
POIs may not be simultaneously detected in different images. Therefore it is 
more reliable to directly match large image areas by taking account of possible 
relative distortions. 

We restrict our consideration to the simplified case when image distortions 
can be closely approximated by affine transformations ffiflOj . Then the least 
square correlation f2l3l4j can be used for finding a transformation that yields 
the largest cross-correlation of the images. 

The least square correlation is widely used in computational binocular stereo 
if relative geometric distortions in a stereo pair are comparatively small |3l4j . In 
this case, although the correlation function is generally multimodal, the gradient 
(steepest ascent) search is used to find the maximum correlation |2|. Such a 
search is based on normal equations obtained by linear approximation of the 
cross-correlation function in the vicinity of a starting point in the space of affine 
parameters. 



R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 105-^^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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The straightforward gradient search is not workable in the multiple-view case 
because of larger relative distortions of the images. The globally optimum match 
can be, in principle, found by exhausting all the values of affine parameters in a 
given range of possible distortions. But this is not computationally feasible. 

In this paper we consider more practical (but only suboptimum) approach 
combining an exhaustion of some affine parameters over a sparse grid of their 
values with a directed search for all the parameters starting from every grid po- 
sition. The directed search is based on a modified Hooke-Jeeves unconstrained 
optimization The proposed modification is intended to take account of the 
multi-modality of cross-correlation. Feasibility of the proposed approach is illus- 
trated by experiments with the RADIUS multiple-view images of a 3 D model 
board scene jOj. 



2 Basic Notation 



Let Rj be a finite arithmetic lattice supporting a greyscale image gj : Rj — >■ G 
where G is a finite set of grey values. Let {x,y) G Rj denote a pixel with the 
column coordinate x and row coordinate y. For simplicity, the origin ( 0 , 0 ) of the 
(x, y)-coordinates is assumed to coincide with the lattice centre. 

Let gi be a rectangular prototype matched in the image §2 to a quadrangular 
area specified by an affine transformation a = [ai , . . . , Oe] . The transformation 
relates each pixel (x, y) in the prototype g\ to the point (xa, j/a) in the image g2'- 



Xa = oix -k 022/ -k as; qn 

2/a = 04X-ka52/-ka6. 

The affine parameters (01,05), (02,04), and (03,05) describe, respectively, the 
X- and //-scaling, shearing, and shifting of 2/2 with respect to g\. 

Grey levels (/ 2 (xa, 2 /a) in the points with non-integer coordinates (xa,2/a) are 
found by interpolating grey values in the neighbouring pixels of the lattice R2. If 
the transformed point (xa,2/a) falls outside of the lattice, then the original pixel 
(x, y) G Ri is assumed to be excluded from matching. 

The least square cross-correlation 



G(a*) = max{C(a)} ( 2 ) 

a 

maximizes by the affine parameters a the conventional cross-correlation 

gi{x,y)-mi 52(a;a, 2/a) - W 2 ,a 



C(a) = ^ 



Si 



S2,i 



( 3 ) 



between the prototype g\ and affinely transformed image 2/2. Here, m and s are 
the mean values and standard deviations, respectively: 

"^1 = iRbd ^ 9i{x,y)] s? = X; {9i{x,y)-mif- 

(x,y)GRi,a ’ (a:,y)GRi,a 

TO2,a=|R;^ E 52 (a:^a, 2 /a); S^ ,, = E (52(a:a,2/a) - m2,a)^ 

(x,y)GRi,a ’ (a;,y)GRi,a 
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and Ri.a = {{x^y) : {{x,y) G Ri) A ((xa,t/a) G R 2 )} denotes the sublattice 

which actually takes part in matching the prototype gi and the affinely trans- 
formed image 52 - 



3 Combined Search for Suboptimal AfRne Parameters 



To approach the least square correlation in Eq. we use the following combined 
exhaustive and directed search in the affine parameter space. For each given 



prototype gi, a sparse grid of the relative shifts 03 °^ and of the matching 
area in the image 32 is exhausted. Starting from each grid position, the modified 
Hooke-Jeeves directed optimization ^ is used to maximize the cross-correlation 
C{a) by all six affine parameters. The largest correlation over the grid provides 
the desired affine parameters a* of the (sub)optimal match. 

The modified Hooke-Jeeves optimization consists of the following two suc- 
cessive stages which are repeated iteratively while the correlation value C(a) 
continues to increase. Each parameter a^, i = 1,...,6, varies in a given 
range ai,max], and the search starts with the initial parameter values 

aM = [1,0, 4°', 0,1, 4°']. 






1. Exploration stage. At each step t = 1,2, ... ,T, the locally best parameter, 

is chosen by changing each parameter i G {1, . . . , 6} under the fixed val- 
ues, : fc i; fc G {1, . . . , 6}], of other parameters. The choice yields the 

largest increase of the correlation C(a[*l) with respect to C(a[*“^l) providing 
the parameters at‘1 and differ by only the value of the locally best pa- 

rameter The exploration steps are repeated while the cross-correlation 
(^(at*!) increases further. 

2. Search stage. The affine parameters a^ = a^ -|- Ad are changed in the con- 
jectured direction d = al^l — al°l of the steepest increase while the cross- 
correlation C(aA) increases further. 



Each exploration step exhausts a given number L of the equispaced param- 
eter values in their range to approach the local correlation maximum along a 
parameter axis, given the fixed previous values of all other parameters. In the 
experiments below L = 3,. . . ,15. The quadratic approximation of these L corre- 
lations provides another possible position of the local maximum. The maximum 
of the L -I- 1 values found is then locally refined using small increments ±5i of 
the parameter. 

The exploration steps converge to a final local maximum value C(a[^l), and 
the parameters allow for inferring the possible steepest ascent direction in 
the parameter space. The search along that direction refines further the obtained 
least square correlation. 

The proposed algorithm replaces the coordinate-wise local search for the clos- 
est correlation maximum of the original Hooke-Jeeves exploration stage with 
the combined exhaustive and directed search. This allows to roughly take into 
account the multimodal character of the cross-correlation function in the pa- 
rameter space and escape some non-characteristic minor modes. Figure H shows 
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Fig. 1. Typical cross-correlations at the exploration steps. 



the typical multi-modal dependence of the cross-correlation from a single affine 
parameter, given the fixed values of other affine parameters. 

Comparing to the conventional least square matching |2], our algorithm does 
not linearize the correlation function in the parameter space and hence does not 
build and use the normal equation matrix. This latter is usually ill-conditioned 
because it depends on the image derivatives with respect to affine parameters. 



4 Experiments with the RADIUS Images 

Image pairs M15-M28, M24-M25, and M29-M30 selected for experiments from 
the RADIUS-M set ^ are shown in Figure |3 The images of size 122 x 96 and 
244 X 192 represent, respectively, the top and the next-to-top levels of image 
pyramids. Each pyramid is built by reducing the original image 1350 x 1035 to 
976 X 768 at the first level and then by the twofold demagnification at each next 
level of the pyramid. 

Some results of matching the top-level image pairs in Figure El using the 
rectangular prototype windows of size 49 x 81 are shown in Tableland FigureEl 
The prototype gx is placed to the central position (61, 48) in the initial image. In 
these experiments, the search grid 5 x 5 of step 5 in both directions is centered 
to the same position (61,48) in the other image (that is, the shift parameters 
for the central grid point are a|j°^ = 0 and = 0), and the parameter L = 11. 
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Fig. 2. RADIUS images M15-M28 (a), M24-M25 (b), and M29-M30 (c) at the top 
and next-to-top pyramid level. 




Fig. 3. Initial top-level RADIUS images M15 (a), M28 (c), M24 (e), M25 (g), M29 
(i), M30 (k) and the affinely transformed images M28 (b), M15 (d), M25 (f), M24 (h), 
M30 (j), M29 (1) adjusted to M15, M28, M24, M25, M29, M30, respectively, with the 
parameters presented in Table 0 
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Table 1. The least square correlation values and the corresponding 
affine parameters obtained by matching the top-level RADIUS images 
M15-M28, M24-M25, and M29-M30 in Figure 0 



Transformation 


C(a‘) 


Affine parameters a* 


a*i 


0,2 


0*3 


al 


ol 


flg 


M28 to M15 


0.62 


1.03 


0.35 


18.4 


-0.50 


1.05 


5.6 


M15 to M28 


0.52 


0.90 


-0.30 


-11.0 


0.30 


0.70 


-7.0 


M25 to M24 


0.66 


0.90 


0.00 


2.0 


0.00 


1.00 


-10.8 


M24 to M25 


0.79 


1.12 


-0.02 


-3.2 


-0.02 


0.90 


7.5 


M30 to M29 


0.69 


1.00 


0.00 


2.0 


-0.01 


0.77 


3.0 


M29 to M30 


0.64 


0.97 


0.00 


-2.0 


0.00 


1.23 


-5.0 




Fig. 4. Next-to-top image M28 affinely adjusted to M15 using the affine parameters 
in Table El The dark rectangles show positions of the prototype windows and the grey- 
coded values of the residual pixel- wise errors for the least square correlation matching. 
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Table 2. Characteristics of the prototype windows to be matched, the least square 
correlation values, and the affine parameters found by matching. 



Fig.g 


Window 

size 


Position 


Search grid 


L 


C(a*) 


Affine parameters a* 


in gn 


in 32 


size 


step 


al 


02 


a-3 


al 


“5 


flg 


a 


50 


X 


35 


85,55 


90,50 


5 


X 


5 


10 


3 


0.67 


1.11 


0.50 


11.0 


-0.50 


1.23 


14.0 


b 


50 


X 


35 


94,53 


97,57 


5 


X 


5 


10 


3 


0.69 


1.10 


0.50 


10.0 


-0.50 


1.24 


9.0 


c 


50 


X 


35 


90,50 


95,55 


5 


X 


5 


10 


3 


0.69 


1.09 


0.50 


8.3 


-0.50 


1.24 


10.0 


d 


50 


X 


35 


80,65 


100,65 


5 


X 


5 


10 


3 


0.69 


1.09 


0.50 


15.2 


-0.50 


1.20 


19.0 


e 


50 


X 


35 


94,53 


105,65 


5 


X 


5 


10 


3 


0.70 


1.09 


0.50 


10.0 


-0.50 


1.24 


8.9 


f 


50 


X 


35 


97,56 


109,69 


5 


X 


5 


10 


3 


0.71 


1.08 


0.50 


11.8 


-0.50 


1.25 


8.0 


g 


50 


X 


35 


100,59 


110,70 


5 


X 


5 


10 


15 


0.72 


1.10 


0.50 


14.0 


-0.50 


1.21 


8.0 


h 


50 


X 


35 


99,58 


111,71 


5 


X 


5 


10 


3 


0.72 


1.10 


0.50 


13.0 


-0.50 


1.23 


7.8 




50 


X 


35 


100,59 


110,70 


3 


X 


3 


10 


7 


0.73 


1.13 


0.50 


14.0 


-0.54 


1.28 


7.7 


j 


50 


X 


50 


100,59 


110,70 


1 


X 


1 


10 


11 


0.73 


1.10 


0.50 


14.0 


-0.50 


1.24 


7.0 


k 


50 


X 


35 


100,59 


110,70 


5 


X 


5 


10 


3 


0.73 


1.10 


0.50 


14.0 


-0.50 


1.26 


7.4 


1 


50 


X 


35 


99,58 


110,70 


5 


X 


5 


10 


3 


0.73 


1.11 


0.50 


13.4 


-0.50 


1.22 


7.9 



Table 3. Characteristics of the prototypes to be matched, the least square correlation 
values, and the affine parameters found by matching. 



Fig. 0Wi 


ndow 

ize 


Position 


Search grid 


L 


C(a*) 


Affine parameters a* 




s 


in gi 


in g 2 


size 


step 


a* 


02 


«3 


al 


“5 


Ug 


a 


50 


X 


35 


100,60 


100,60 


5 


X 


5 


15 


5 


0.49 


0.54 


0.06 


31.0 


0.21 


0.86 


-2.5 


b 


50 


X 


50 


110,70 


100,60 


5 


X 


5 


10 


3 


0.52 


0.75 


-0.28 


-12.0 


0.12 


1.00 


-4.0 


c 


50 


X 


50 


100,65 


100,65 


5 


X 


5 


10 


15 


0.53 


0.83 


-0.29 


-7.0 


0.14 


0.97 


-6.3 


d 


75 


X 


50 


100,65 


100,65 


5 


X 


5 


10 


15 


0.53 


0.83 


-0.29 


-7.0 


0.14 


0.97 


-6.3 


e 


50 


X 


35 


100,60 


100,60 


5 


X 


5 


10 


15 


0.68 


0.75 


-0.36 


-10.0 


0.32 


0.66 


-10.0 


f 


50 


X 


50 


100,60 


100,60 


5 


X 


5 


10 


15 


0.69 


0.74 


-0.32 


-9.0 


0.29 


0.68 


-10.0 


g 


50 


X 


50 


110,70 


100,60 


5 


X 


5 


10 


11 


0.71 


0.77 


-0.30 


-14.0 


0.30 


0.70 


-10.0 


h 


50 


X 


50 


110,70 


100,60 


5 


X 


5 


10 


5 


0.71 


0.75 


-0.30 


-14.0 


0.31 


0.69 


-10.0 


i 


50 


X 


35 


100,65 


100,65 


5 


X 


5 


10 


15 


0.71 


0.75 


-0.29 


-10.0 


0.32 


0.64 


-11.0 


j 


50 


X 


50 


100,59 


110,70 


1 


X 


1 


10 


11 


0.72 


0.76 


-0.28 


-14.0 


0.33 


0.67 


-10.3 


k 


50 


X 


35 


110,70 


110,70 


5 


X 


5 


10 


3 


0.72 


0.77 


-0.31 


-14.0 


0.32 


0.68 


-10.0 


1 


50 


X 


35 


110,70 


100,60 


5 


X 


5 


15 


5 


0.72 


0.76 


-0.31 


-14.0 


0.32 


0.68 


-10.0 



In all our experiments the ranges of the affine parameters oi, 04 and 02,03 
are [0.5, 1.5] and [—0.5, 0.5], respectively. The ranges of the parameters 03 and 
og are given by the width and height of the chosen prototype window. 

In these cases, the least square correlation matching allows, at least as a first 
approximation, to relatively orient all the three image pairs. For comparison. 
Table E] presents results of matching the top-level images M15 and M28 using 
the two larger prototype windows of size 71 x 51 and 81 x 61 placed to the central 
position (60, 50) in M15. Here, the search grid 5 x 5 of step 10 is sequentially 
centered to the nine neighbouring positions (60 ± 1,50 ± 1) in M28. Although 
the photometric distortions of the images are non-uniform, the median values 
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Fig. 5. Next-to-top image M15 affinely adjusted to M28 using the affine parameters 
in Table O The dark rectangles show positions of the prototypes and the grey-coded 
values of the residual pixel-wise errors for the least square correlation matching. 



of the obtained affine parameters for the confident matches with 

C(a*) >0.55 are quite similar to the like parameters in Table E 

Tables 0- Eland Figures 0- El show results of matching the images M15 and 
M28 on the next-to-top level of the pyramids. Here, different prototype windows 
and various search grids and approximation orders are compared. The position 
of the prototype window with respect to the image is shown by a dark rectangle 
giving the grey-coded residual pixel-wise errors of matching (the darker the 
pixel, the smaller the error). 

The matching results are mostly similar although in the general case they 
depend on the search characteristics, in particular, on the chosen search grid 
and the parameter L (e.g., Figures Elaib,e,g, and the corresponding data in 
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Table 4. Central positions of the search grid in M28, the least square correlation 
values, and the affine parameters found by matching M28 to M15. 



Window 


Position 


C(a*) 


Affine parameters a* 


size 


in M28 




a* 


0*2 


«3 


a\ 


* 

“5 


al 


71 X 51 


59,49 


0.66 


1.10 


0.50 


18.0 


-0.50 


1.21 


3.0 




60,49 


0.59 


1.10 


0.40 


18.0 


-0.50 


1.10 


5.0 




61,49 


0.59 


1.10 


0.40 


18.0 


-0.50 


1.10 


5.0 




59,50 


0.67 


1.09 


0.48 


18.0 


-0.50 


1.20 


3.0 




60,50 


0.67 


1.09 


0.48 


18.0 


-0.50 


1.20 


3.0 




61,50 


0.67 


1.09 


0.48 


18.0 


-0.50 


1.20 


3.0 




59,51 


0.54 


1.10 


0.30 


19.0 


-0.50 


1.00 


7.0 




60,51 


0.52 


1.00 


0.30 


20.0 


-0.56 


1.00 


7.0 




61,51 


0.53 


1.00 


0.40 


20.0 


-0.56 


1.28 


4.0 


Median parameter values 
for the confident matches 


1.09 


0.48 


18.0 


-0.50 


1.20 


4.0 


81 X 61 


59,49 


0.55 


1.02 


0.50 


16.0 


-0.40 


1.10 


6.0 




60,49 


0.58 


1.10 


0.44 


18.0 


-0.50 


1.36 


5.0 




61,49 


0.65 


1.10 


0.44 


18.0 


-0.50 


1.20 


3.0 




59,50 


0.55 


1.02 


0.50 


16.0 


-0.40 


1.10 


6.0 




60,50 


0.58 


1.10 


0.44 


18.0 


-0.50 


1.36 


5.0 




61,50 


0.65 


1.10 


0.44 


18.0 


-0.50 


1.20 


3.0 




59,51 


0.55 


1.02 


0.50 


16.0 


-0.40 


1.10 


6.0 




60,51 


0.57 


1.00 


0.50 


15.4 


-0.50 


1.20 


4.0 




61,51 


0.60 


1.05 


0.50 


16.0 


-0.50 


1.20 


4.0 


Median parameter values 
for the confident matches 


1.10 


0.50 


16.0 


-0.50 


1.20 


5.0 



Table E|) • Also, the larger prototype windows may affect the precision of the 
affine approximation of actually projective image distortions (Figures E]c,d,i). 

The median values of the obtained affine parameters [a*, . . . , Og] for the seven 
best matches are as follows: 
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M15-M28: 


1.10 


0.50 


14.0 


-0.50 


1.24 


7.9 


M28-M15: 


0.75 


-0.30 


-14.0 


0.32 


0.68 


-10.0 



These values are close to the parameters found by matching the top-level images 
so that the fast top-level matching can provide a first approximation of the 
relative geometric distortions of these images to be refined at the next levels 
of the image pyramids. Similar results are also obtained for the image pairs 
M24-M25 and M29-M30 as well as for other RADIUS images (e.g., M24-M10, 
MlO-Mll, M11-M19, M19-M20, M8-M9, M9-M23, M23-M29, M30-M36, etc). 
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5 Concluding Remarks 

These and other experiments show that the proposed modified Hooke-Jeeves 
optimisation algorithm permits us to successfully match large-size areas in the 
multiple- view images of a 3D scene by the least square correlation, provided 
the relative image distortions can be affinely approximated. This approach has 
a moderate computational complexity, hence in principle it can be used at the 
initial stage of the uncalibrated multiple-view terrain reconstruction. 

The approach exploits almost no prior information about a 3D scene, ex- 
cept for the ranges of the affine parameters for matching. Also, the final cross- 
correlation value provides a confidence measure for the obtained results: if the 
correlation is less than or equal to 0.5 - 0.55, one may conclude that the matching 
fails, otherwise the larger the correlation, the higher the confidence and, in the 
most cases, the better the affine approximation of the relative image distortions. 
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Abstract. In this article we want to introduce first the Gabor wavelet 
network as a model based approach for an effective and efficient object 
representation. The Gabor wavelet network has several advantages such 
as invariance to some degree with respect to translation, rotation and 
dilation. Furthermore, the use of Gabor filters ensured that geometrical 
and textural object features are encoded. The feasibility of the Gabor 
filters as a model for local object features ensures a considerable data 
reduction while at the same time allowing any desired precision of the 
object representation ranging from a sparse to a photo-realistic repre- 
sentation. In the second part of the paper we will present an approach 
for the estimation of a head pose that is based on the Gabor wavelet 
networks. 



1 Introduction 

Recently, model-based approaches for the recognition and the interpretation of 
images of variable objects, like the bunch graph approach, PCA, eigenfaces and 
active appearance models, have received considerable interest mi El 0 nm 
These approaches achieve good results because solutions are constrained to be 
valid instances of a model. In these approaches, the term “model-based” is un- 
derstood in the sense that a set of training objects is given in the form of gray 
value pixel images while the model “learns” the variances of the gray values 
(PCA, eigenfaces) or, respectively, the Gabor filter responses (bunch graph). 
With this, model knowledge is given by the variances of pixel gray values, which 
means that the actual knowledge representation is given on a pixel basis, that 
is independent from the objects themselves. 

In this work we want to introduce a novel approach for object representation 
that is based on Gabor Wavelet Networks. Gabor Wavelet Networks (GWN) 
are combining the advantages of RBF networks with the advantages of Gabor 
wavelets: GWNs represent an object as a linear combination of Gabor wavelets 
where the parameters of each of the Gabor functions (such as orientation and 
position and scale) are optimized to reflect the particular local image structure. 
Gabor wavelet networks have several advantages: 
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1. By their very nature, Gabor wavelet networks are invariant to some degree 
to affine deformations and homogeneous illumination changes, 

2. Gabor filters are good feature detectors EOIEII and the optimized param- 
eters of each of the Gabor wavelets are directly related to the underlying 
image structure, 

3. the weights of each of the Gabor wavelet are directly related to their filter 
responses and with that they are also directly related to the underlying local 
image structure, 

4. the precision of the representation can be varied to any desired degree rang- 
ing from a coarse representation to an almost photo-realistic one by simply 
varying the number of used wavelets. 

We will discuss each single point in section Q 

The use of Gabor filters implies a model for the actual representation of the 
object information. In fact, as we will see, the GWN represents object informa- 
tion as a set of local image features, which leads to a higher level of abstraction 
and to a considerable data reduction. Both, textural and geometrical information 
is encoded at the same time, but can be split to some degree. 

The variability in precision and the data reduction are the most important 
advantage in this context, that has several consequences: 

1 . Because the parameters of the Gabor wavelets and the weights of the network 
are directly related to the structure of the training image and the Gabor filter 
responses, a GWN can be seen as a task oriented optimal filter bank: given 
the number of filters, a GWN defines that set of filters that extracts the 
maximal possible image information. 

2. For real-time applications one wants to keep the number of filtrations low 
to save computational resources and it makes sense in this context to relate 
the number of filtrations to the amount of image information really needed 
for a specific task: In this sense, it is possible to relate the representation 
precision to the specific task and to increment the number of filters if more 
information is needed. This, we call progressive attention. 

3. The training speed of neural networks, that correlates with the dimension- 
ality of the input vector. 

The progressive attention is related to the ineremental foeus of attention 
(IFA) for tracking |2Zj or the attentive processing strategy (GAZE) for face fea- 
ture detection H3|. Both works are inspired by |2S! and relate features to scales 
by using a coarse-to-fine image resolution strategy. In contrary, the progressive 
attention should not relate features to scale but to the object itself that is de- 
scribed by these features. In this sense, the object is considered as a collection 
of image features and the more information about the object is needed to fulfill 
a task the more features are extracted from the image. 

In the following section we will give a short introduction to GWNs. Also, we 
will discuss each single point mentioned above, including the invariance proper- 
ties, the abstraction properties and specificity of the wavelet parameters for the 
object representation and a task oriented image filtration. 
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In section |3 we will present the results of our pose estimation experiment 
where we exploited the optimality of the filter bank and the progressive attention 
property to speed up the response time of the system and to optimize the training 
of the neural network. 

In the last section we will conclude with some final remarks. 



2 Introduction to Gabor Wavelet Networks 



The basic idea of the wavelet networks is first stated by m, and the use of 
Gabor functions is inspired by the fact that they are recognized to be good 
feature detectors PD] ■ 

To define a GWN, we start out, generally speaking, by taking a family of N 
odd Gabor wavelet functions W = {V'ni, • ■ • jV'njvl of the form 



/If ^ 

V'n {x, y) = exp ( - - ((a; - c^) cos 0 - (y - Cy) sin 9) 



+ Sy ((x — Cx) sin 0 + (y — Cy) cos 9) 

X sin ((a; — Cx) cos 9 — {y — Cy) sin0) ^ 



( 1 ) 



with n = {cx,Cy,9, Sx, Sy)^ . Here, Cx, Cy denote the translation of the Gabor 
wavelet, Sx, Sy denote the dilation and 0 denotes the orientation. The choice 
of N is arbitrary and is related to the maximal representation precision of the 
network. The parameter vector n (translation, orientation and dilation) of the 
wavelets may be chosen arbitrarily at this point. In order to find the GWN for 
image I, the energy functional 



E = 



min 

rij ,Wi for all i 



\\I 

i 



(2) 



is minimized with respect to the weights Wi and the wavelet parameter vector 
n^. Equation (0 says that the Wi and rij are optimized (i.e. translation, dilation 
and orientation of each wavelet are chosen) such that the image / is optimally 
approximated by the weighted sum of Gabor wavelets 'ip-ni- We therefore define 
a Gabor wavelet network as follows: 



Definition: Let ipm, i = 1, ■■■, N he a, set of Gabor wavelets, / a DG-free 
image and Wi and n^ chosen according to the energy functional 10) . The two 
vectors 

= (V'ni, • ■ ■ ,^nw)^ and w = (mi, . . . ,wn)'^ 
define then the Gabor wavelet network (S', w) for image /. 

It should be mentioned that it was proposed before 0 0 E] to use an 
energy functional (0) in order to find the optimal set of weights Wi for a fixed set 
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Fig. 1. The very right image shows the original face image 7, the other images show 
the image I, represented with 16, 52, 116 and 216 Gabor wavelets (left to right). 



Fig. 2. The images show a Gabor wavelet network 
with = 16 wavelets after optimization (left) and 
the indicated positions of each single wavelet (right). 

of non-orthogonal wavelets ipn- . We enhance this approach by finding also the 
optimal parameter vectors for each wavelet . . The parameter vectors are 
chosen from continuous phase space K.® jS] and the Gabor wavelets are positioned 
with sub-pixel accuracy. This is precisely the main advantage over the discrete 
approach El El . While in case of a discrete phase space local image structure has 
to be approximated by a combination of wavelets, a single wavelet can be chosen 
selectively in the continuous case to reflect precisely the local image structure. 
This assures that a maximum of the image information is encoded. 

Using the optimal wavelets S' and weights w of the Gabor wavelet network 
of an image /, I can be (closely) reconstructed by a linear combination of the 
weighted wavelets: 




N 

i=l 

Of course, the quality of the image representation and of the reconstruction 
depends on the number N of wavelets used and can be varied to reach almost 
any desired precision. In section we will discuss the relation between I and I 
in more detail. An example reconstruction can be seen in fig. ^ A family of 216 
wavelets has been distributed over the inner face region of the very right image 
/ by the minimization formula (0. Different reconstructions I with formula m 
with various N are shown in the first four images. 

A further example can be seen in fig. [2 The left image shows a reconstruc- 
tion with 16 wavelets and the right image indicates the corresponding wavelet 
positions. It should be pointed out that at each indicated wavelet position, just 
one single wavelet is located. 

2.1 Feature Representation with Gabor Wavelets 

It was mentioned in the introduction that the Gabor wavelets are recognized to 
be good feature EHI 1^ detectors, that are directly related to the local image 
features by the optimization function in eq. El This means that an optimized 
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Fig. 3. The figure shows images of a wooden toy 
block on which a GWN was trained. The black line 
segments sketch the positions, sizes and orientations 
of all the wavelets of the GWN (left), and of some 
automatically selected wavelets (right). 

wavelet has e.g. ideally the exact position and orientation of a local image fea- 
ture. An example can be seen in fig. El The figure shows the image of a little 
wooden toy block, on which a Gabor wavelet network was trained. The left image 
shows the positions, scales and orientations of the wavelets as little black line 
segments. By thresholding the weights, the more “important” wavelets may be 
selected, which leads to the right image. Ideally, each Gabor wavelet should be 
positioned exactly on the image line after optimization. Furthermore, since large 
weights indicate that the corresponding wavelets represents an edge segment (see 
sec. E2D, these wavelets encode local geometrical object information. In reality, 
however, interactions with other wavelets of the network have to be considered 
so that most wavelet parameters reflect the position, scale, and orientation of 
the image line closely, but not precisely. This fact is clearly visible in fig. El As 
it can be seen in fig. ^ an object can be represented almost perfectly with a 
relatively small set of wavelets. The considerable data reduction is achieved by 
the introduction of the model for local image primitives, i.e. the introduction of 
Gabor wavelets. 

The use of Gabor filters as a model for local object primitives leads to a higher 
level of abstraction where object knowledge is represented by a set of local image 
primitives. The Gabor wavelets in a network that represent edge segments can be 
easily identified. How to identify wavelets, however, that encode specific textures 
is not really clear, yet, and subject to future investigation. 

Other models for local image primitives have been tested such as Gaussian 
and their derivatives, which are often used as radial basis functions in RBF 
networks Q. It is interesting, however, that all other models have proven to be 
much less capable. 







2.2 Direct Calculation of Weights and Distances 



As mentioned earlier, the weights Wi of a GWN are directly related to the filter 
responses of the Gabor filters ipni on the training image. 

Gabor wavelet functions are not orthogonal. For a given family iF of Gabor 
wavelets it is therefore not possible to calculate a weight Wi directly by a simple 
projection of the Gabor wavelet V'n; onto the image. Instead one has to consider 
the family of dual wavelets W = {'0m ■••'0n_w}- The wavelet ipnj is the dual 
wavelet to the wavelet 0„. iff = Sij. With = (■0m,... ,'0mv)^i 



we can write 






= F. In other words: Wi = (/, '0„J . We find '0m to be 



= Ej ('^ ^)ij where = (0m, 0m)- 
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Fig. 4. An image I from the image space I 
is mapped by the linear mapping ^ on the 
vector w of the vector space > 

over the basis fnnctions The mapping 
of w into < >C / is achieved with the 
linear mapping 'I' can be identified with 
the pseudo inverse of ^ and the mapping of 
/ onto w € R^, /!^ = w, is an orthogonal 
projection. 



The equation Wi = ( J, ipm ) allows us to define the operator 

(3) 

as follows: Given a set of optimal wavelets of a GWN, the operator 71? realizes 
an orthogonal projection of a function J onto the closed linear span of S' (see 
eq. 0) and fig. 4), i.e. 



N 

J = T^{J) = , with w = JW . (4) 

i=l 

The direct calculation of the distance between two families of Gabor wavelets, 
S' and <P, can also be established by applying the above to each of the wavelets 
(|)^ e <P: 






( 5 ) 



which can be interpreted as the representation of each wavelet (j)j as a superpo- 
sition of the wavelets ipi . With this, the distance between ']/ and ^ can be given 
directly by 






n 2 



E 



ll^ii - T^{4>i) 

m 



+ 



-\ 2 



E 



llV’i - 

m 



( 6 ) 



where || • || is the euclidian norm. With this distance measurement, the distance 
between two object representations can be calculated very efficiently. 



2.3 Reparameterization of Gabor Wavelet Networks 

The task of finding the position, the scale and the orientation of a GWN in a 
new image is most important because otherwise the filter responses are without 
any sense. Here, PGA, bunch graphs and GWN have similar properties: In case 
of the PGA and bunch graph it is important to ensure that corresponding pixels 
are aligned into a common coordinate system, in case of the GWN, local image 
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primitives have to be aligned. For example, consider an image J that shows the 
person of fig. ^ left, possibly distorted affinely. Given a corresponding GWN we 
are interested in finding the correct position, orientation and scaling of the GWN 
so that the wavelets are positioned on the same facial features as in the original 
image, or, in other words, how should the GWN be deformed (warped) so that 
it is aligned with the coordinate system of the new object. An example for a 
successful warping can be seen in fig. 0 where in the right image the wavelet 
positions of the original wavelet network are marked and in fig. E] where in new 
images the wavelet positions of the reparameterized Gabor wavelet network are 
marked. Parameterization of a GWN is established by using a superwavelet m-- 





Fig. 5. The images show the positions of each of the 16 wavelets after reparameterizing 
the wavelet net and the corresponding reconstruction. The reconstructed faces show 
the same orientation, position and size as the ones they were reparameterized on. 

Definition: Let (<F, w) be a Gabor wavelet network with W = {t/jm , ■ ■ ■ , V’nw )^> 
w = {wi, . . . , wn)'^ ■ A superwavelet is defined to be a linear combination 
of the wavelets ipni such that 

!F„(x) = ^ WiV'rii(SR(x - c)) , (7) 

i 

where the parameters of vector n of superwavelet W define the dilation matrix 
S = diag(sa;, Sy), the rotation matrix R, and the translation vector c = 

( Oa; ; Cy ) 

A superwavelet 'Fn is again a wavelet (because of the linearity of the sum) and 
in particular a continuous function that has the wavelet parameters dilation, 
translation and rotation. Therefore, we can handle it in the same way as we 
handled each single wavelet in the previous section. For a new image J we may 
arbitrarily deform the superwavelet by optimizing its parameters n with respect 
to the energy functional E\ 



E = min || J - 

n 



Equation (0 defines the operator 



: L2(R2) I — ^ 

9 t n (Ca; ^ Cy^O ^ Sx , Sy') , 



( 8 ) 

( 9 ) 
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where n minimizes the energy functional E of eq. (0 . In eq. 0 is defined to 
be a superwavelet. For optimization of the superwavelet parameters, the same 
optimization procedure as for eq.|2|may be used. An example of an optimization 
process can be seen in fig. (Sj Shown are the initial values of n, the values after 
2 and 4 optimization cycles of the gradient decent method and the final values 
after 8 cycles, each marked with the white square. The square marks the inner 
face region and its center position marks the center position of the corresponding 
superwavelet. The superwavelet used in fig. El is the one of fig. El i.e. it is derived 
from the person in fig. E 




Fig. 6. The images show the 1st, the 2th, the 4th and the 8th (final) step of the gradient 
descent method optimizing the parameters of a superwavelet. The top left image shows 
the initial values with 10 px. off from the true position, rotated by 10° and scaled by 
20%. The bottom right image shows the final result. As superwavelet, the GWN of 
figure Q was used. 

The image distortions of a planar object that is viewed under orthographic 
projection is described by six parameters: translation Cx, Cy, rotation 0, and di- 
lation Sx, Sy and Sxy The degrees of freedom of a wavelet only allow translation, 
dilation and rotation. However, it is straight forward to include also shearing 
and thus allow any affine deformation of For this, we enhance the parameter 
vector n to a six dimensional vector n = (cx, Cy, 9, Sx,Sy, Sxy)"^ ■ By rewriting the 
scaling matrix S, 



S = 





we are now able to deform the superwavelet affinely. 

The reparameterization (warping) works quite robust: Using the superwavelet 
of fig. □ we have found in several experiments on the various subjects with 
~ 60 pixels in width that the initialization of no may vary from the correct 
parameters by approx. ±10 px. in x and y direction, by approx. 20% in scale 
and by approx. ±10° in rotation (see fig. E|)- Compared to the AAM, these 
findings indicate a much better robustness |S|. Furthermore, we found that the 
warping algorithm converged in 100% of the cases to the correct values when 
applied on the same individual, independently of pose and gesture. The tests 
were done on the images of the Yale face database m and on our own images. 
The poses were varied within the range of ~ ±20° in pan and tilt where all face 
features were still visible. The various gestures included normal, happy, sad, 
surprised, sleepy, glasses, wink. The warping on other faces depended certainly 
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on the similarity between the training person and the test person and on the 
number of used wavelets. We found that the warping algorithm always converged 
correctly on Ri 80% of the test persons (including the training person) of the Yale 
face database. The warping algorithm has also been successfully applied for a 
wavelet based affine real-time face tracking application m- 

2.4 Related Work 

There are other models for image interpretation and object representation. Most 
of them are based on PC A H, such as the eigenface approach . The eigenface 
approach has shown its advantages expecially in the context of face recognition. 
Its major drawbacks are its sensitivity to perspective deformations and to il- 
lumination changes. PCA encodes textural information only, while geometrical 
information is discarded. Furthermore, the alignment of face images into a com- 
mon coordinate system is still a problem. 

Another PCA based approach is the active appearance model ( AAM) jSI . This 
approach enhances the eigenface approach considerably by including geometrical 
information. This allows an alignment of image data into a common coordinate 
system while the formulation of the alignment technique can be elegantly done 
with techniques of the AAM framework. Also, recognition and tracking applica- 
tions are presented within this framework m- An advantage of this approach 
was demonstrated in j5]: they showed the ability of the AAM to model, in a 
photo-realistic way, almost any face gesture and gender. However, this is undou- 
bly an expensive task and one might ask for which task such a precision is really 
needed. In fact, a variation to different precision levels in order to spare compu- 
tational resources and to restrict considerations to the data actually needed for 
a certain application seems not easily possible. 

The bunch graph approach ini is based, on the other hand, on the discrete 
wavelet transform. A set of Gabor wavelets are applied at a set of hand selected 
prominent object points, so that each point is represented by a set of filter 
responses, called jet. An object is then represented by a set of jets, that encode 
each a single local texture patch of the object. The jet topology, the so-called 
image graph, encodes geometrical object information. A precise positioning of 
the image graph onto the test image is important for good matching results and 
the positioning is quite a slow process. The feature detection capabilities of the 
Gabor filters are not exploited since their parameters are fixed and a variation 
to different precision levels has not been considered so far. 

The bunch graph approach is inspired by the discrete wavelet transform, 
where, in contrary to the continuous wavelet transform, the phase space is dis- 
crete. The problem of how to sample the phase space is a major problem in 
this context and is widely studied 0 H 13 [Q • In general, the dis- 

cretization scheme depends on the selected wavelet function. It was studied by 
Lee in HU how dense the phase space has to be sampled in order to achieve 
a lossless wavelet representation of an image when a Gabor function (which is 
non-orthonormal) is used as a wavelet. He found out that for each discrete pixel 
position one needs eight equidistant orientation samples and five equidistant 
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Fig. 7. The left image shows the original doll 
face image I, the right image shows its recon- 
struction lifi using formula 0 with an opti- 
mal wavelet net ^ of just N = 52 odd Gabor 
wavelets, distributed over the inner face region. 
For optimization, the scheme that was intro- 
duced in section 13 was applied. 

scale samples. One sees that this justifies the choice of 40 Gabor filters as given 
in m- However, one also sees that a image representation with 40 wavelets per 
pixel is a highly redundant representation and only applicable if reduced to a 
small set of feature points in the image, as done in nq. With this, a usual bunch 
graph representation contains about 20 jets for each object with altogether 800 
complex coefficients. The reason for this highly redundant representation PH 
is that the set of filters is static, and not dynamic, as in the GWN. Alterna- 
tively, one may model the local image structure directly by explicitly selecting 
the correct Gabor wavelet parameters of the continuous phase space. This is 
the underlying idea of the Gabor wavelet networks. With this, as shown above, 
as less as 52 Gabor wavelets are sufficient for a good representation of a facial 
image while already 216 Gabor wavelets reach almost perfect quality. 

3 Pose Estimation with GWN 

In this section we will present the approach for the estimation of the pose of a 
head. There exist many different approaches for pose estimation, including pose 
estimation with color blobs HEIIS!, pose estimation applying a geometrical 
approach in, stereo information or neural networks to cite just a few. 
While in some approaches, such as in only an approximate pose is es- 

timated, other approaches have the goal to be very precise so that they could 
even be used as a basis for gaze detection such as in m- The precision of the 
geometrical approach m was extensively tested and verified in m- The mini- 
mal mean pan/tilt error that was reached was > 1.6°. In comparison to this, the 
neural network approach in [3 reached a minimal pan/tilt error of > 0.64°. 

The good result in 0 was reached by first detecting the head using a color 
tracking approach. Within the detected color blob region, 4x4 sets of 4 complex 
Gabor filters with the different orientations of 0, j, § and |7 t were evenly 
distributed. The 128 coefficients of these 64 complex projections of the Gabor 
filters were then fed into a neural LLM network. 

At this point, it is reasonable to assume that a precise positioning of the 
Gabor filters would result into an even lower mean pan/tilt error. In our exper- 
iments we therefore trained a GWN on an image I showing a doll’s head. For 
the training of the GWN we used again the optimization scheme introduced in 
section 0with N = 52 Gabor wavelets (see fig. 0). In order to be comparable 
with the approach in |2] we used in our experiments exactly the same neural 
network and the same number of training examples as described in |2]. A sub- 
space variant of the Local Linear Map (LLM) pi] was used for learning input 
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- output mappings 0- The LLM rests on a locally linear (first order) approx- 
imation of the unknown function / : M" i— >■ and computes its output as 
(winner-take-all-variant) y{x) = Ai,mu{x — ctmu) + o&mu- Here, Ohmu S is an 
output vector attached to the best matching unit (zero order approximation) 
and Abmu G is a local estimate of the Jacobian matrix (first oder term). 

Centers are distributed by a clustering algorithm. Due to the first oder term, the 
method is very sensitive to noise in the input. With a noisy version x' = x + r] 
the output differs by AbmuV, and the LLM largely benefits from projecting to 
the local subspace, canceling the noise component of rj orthogonal to the input 
manifold M. As basis functions normalized Gaussians were used. 

The doll’s head was connected to a robot arm, so that the pan/tilt ground 
truth was known. During the training and testing, the doll’s head was first 
tracked using our wavelet based face tracker PI- For each frame we proceeded 
in two steps: 

1. optimal reparameterization of the GWN by using the positioning operator 

V 

2. calculating the optimal weights for the optimally repositioned GWN by using 
the projection operator T. 

See fig. 0 for example images. The weight vector that was calculated with the 




Fig. 8. The images show different orientations of the doll’s head. The head is connected 
to a robot arm so that the ground truth is known. The white square indicates the 
detected position, scale and orientation of the GWN. 

operator T was then fed into the same neural network that was used in |2|. 
The training was done exactly as it was described in |2j: We used 400 training 
images, evenly distributed within the range of ±20° in pan and tilt direction 
(this is the range where all face features appeared to be visible). With this, we 
reached a minimal mean pan/tilt error of 0.19° for a GWN with 52 wavelets 
and a minimal mean pan/tilt error of 0.29° for a GWN with 16 wavelets. The 
maximal errors were 0.46° for 52 wavelets and 0.81° for 16 wavelets, respectively. 
The experiments were carried out on an experimental setup, that has not yet 
been integrated into a complete, single system. A complete system should reach 
a speed on a 450 MHz Linux Pentium of >~ 5 fps for the 52 wavelet network 
and >~ 10 fps for the 16 wavelet network 0. 

In comparison, for the gaze detection in PH, 625 training images were used, 
with a 14-D input vector, to train an LLM-network. The user was advised to 

^ This is a conservative estimation, various optimizations should allow higher frame 
rates. 
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fixate a 5 X 5 grid on the computer screen. The minimal errors after training for 
pan and tilt were 1.5° and 2.5°, respectively, while the system speed was 1 Hz on 
a SGI (Indigo, High Impact). A direct comparison to geometrical approaches is 
difficult, because, by their very nature, the cited ones are less precise, less robust 
but much faster. 



4 Conclusions 

The contribution of this article is twofold: 

1. We introduced the concepts of the Gabor wavelet network and the Gabor 
superwavelet that allow a data reduction and the use of the progressive at- 
tention approach: 

— The representation of an object with variable degree of precision, from 
a coarse representation to an almost photo-realistic one, 

— the definition of an optimal set of filters for a selective filtering 

— the representation of object information on a basis of local image prim- 
itives and 

— the possibility for affine deformations to cope with perspective deforma- 
tions. 

In the second section we discussed these various properties in detail. In uni 
ITC] . GWNs have already been used successfully for wavelet based affine real 
time face tracking and pose invariant face recognition. It is future work, to 
fully exploit the advantages of the data reduction by reducing considerations 
to the vector space over the set of Gabor wavelet networks. 

2. We exploited all these advantages of the GWN for the estimation of the head 
pose. The experimental results show quite impressively that it is sensible for 
an object representation to reflect the specific individual properties of the 
object rather than being independent of the individual properties such as 
general representations are. This can especially be seen when comparing 
the presented approach with the one in |2j: While having used the same 
experimental setup and the same type of neural network, the precision of the 
presented approach is twice as good with only 16 coefficients (vs. 128), and 
three times as good with only about half the coefficients. Furthermore, the 
experiment shows, how the precision in pose estimation and the system speed 
change with an increasing number of filters. A controllable variability of 
precision and speed has a major advantage: The system is able to decide how 
precise the estimation should be in order to minimize the probability that 
the given task is not fulfilled satisfactorily. It is future work to incorporate 
the experimental setup into a complete system. An enhancement for the 
evaluation of the positions of the irises for a precise estimation of gaze is 
about to be tested. 
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Abstract. Consider scenes deteriorated by reflections off a semi- 
reflecting medium (e.g., a glass window) that lies between the observer 
and an object. We present two approaches to recover the superimposed 
scenes. The first one is based on a focus cue, and can be generalized 
to volumetric imaging with multiple layers. The second method, based 
on a polarization cue, can automatically label the reconstructed scenes 
as reflected/transmitted. It is also demonstrated how to blindly deter- 
mine the imaging PSF or the orientation of the invisible (semi-reflecting) 
surface in space in such situations. 



1 Introduction 



This work deals with the situation in which the projection of the scene on the 
image plane is multi-valued due to the superposition of several contributions. 
This situation is encountered while looking through a window, where we see 
both the outside world (termed real object j1 211 ) . and a semi-reflection of the 

objects inside, termed virtual objects. It is also encountered in microscopy and 
tomography, where viewing a transparent object slice is disturbed by the su- 
perposition of adjacent defocused slices. Our goal is to clear the the disturbing 
crosstalk of the superimposing contributions, and gain information on the scene 
structure. 

Previous treatment of this situation has been based mainly on motion PEE 
II 3l23f2dj . and stereo 022|. Polarization cues have been used for such scenarios 
in Refs. j8l 1 2| . Thorough polarization analysis, that enabled the labeling of the 
layers was done in This paper refers to these methods, and deals 

also with the use of depth of field (DOF) to analyze images. DOF has been 
utilized for analysis of transparent layers mainly in microscopy [1I5I6I11] . but 
mainly in cases of opaque (and occluding) layers, as in f2l1 D) . 
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Following m we show that the methods that rely on motion and stereo are 
closely related to approaches based on DOF. Then, we show how to recover the 
transparent layers using a focus cue. In the case of semireflections, we follow 
Refs. 1 1 I tiHOj to show that two raw images are adequate to recover the layers. 
The recovery is done in conjunction to a simple way to estimate the transfer 
function between the images, based on the raw images, yielding the optimal 
layer separation. We then show how to label each layer as reflected or transmitted 
using a polarization cue, which also indicates the orientation of the invisible semi- 
reflecting window in space. Following [II YIIiSIIDmi] . our use of the polarization 
cue is effective also away from the Brewster angle. 

2 Distance Cues 

2.1 Defocus vs. Stereo or Motion 

Depth cues are usually very important for the separation of transparent layers. In 
microscopy, each superimposing layer is at a different distance from the objective. 
Hence when one layer is focused (at a certain image slice) the others are defocus 
blurred, and this serves as the handle for removal of the inter-layer crosstalk. In 
case of semi-reflected scenes, the real object is unrelated to the virtual object. 
So, it is reasonable that also in this case the distance of each layer from the 
imaging system is different. Therefore, if the scene is viewed by a stereo system, 
each layer will have a different disparity; if the camera is moved, each layer will 
have different motion parameters; and again, if one layer is focused, the other is 
defocused. 

Note that depth from defocus blur or focus are realizations of triangulation 
just as depth from stereo or motion are realizations of this principle |14j . Con- 
sider Fig. H where a stereo (or depth from motion) system is viewing the same 
scene as a monocular wide aperture camera of the same physical dimensions: 
the stereo baseline D is the the same as the lens aperture, the distance from 
the lens to the sensor, v, is the same, and the stereo system is fixated on a 
point at the same distance at which the wide-aperture system is focused. Then, 
the disparity is equal to the defocus blur diameter, under the geometric optics 
approximation 0 

Therefore, algorithms developed based on defocus blur can be applied to 
approaches based on stereo or motion, and vice-versa. Besides, these approaches 
will have similar fundamental limitations. Particularly for the scenarios treated 
in this work, seeking a separation of two transparent layers out of two raw images, 
in each of which the focus is different, is equivalent to separating the layers 
using two raw images in which the disparities are different. Blind estimation 
of the defocus blur kernels (or the transfer functions of the imaging system) is 
equivalent to seeking the motion parameters between the two stereo raw images. 

In this section we treat the transparent-layers problem using methods that 
rely on defocus blur. However, due to the close relationship of defocus to stereo 
and motion, the reader may generally interchange the “defocus parameters” 
with “motion parameters” and extend the conclusions to classical triangulation 
approaches. 
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Fig. 1. [Left] The image of a defocused object point at a certain distance is a blur 
circle of diameter d. [Middle] Its image becomes two points separated by d, if the 
lens is blocked except for two pinholes at opposite sides on the lens perimeter. [Right] 
The disparity equals the same d in a stereo/motion system having the same physical 
dimensions and fixating on the point at the same distance as the point focused by the 
system on the left. 



2.2 Recovery from Known Transfer Flmctions 

Consider a two-layered scene. We acquire two images, such that in each image 
one of the layers is in focus. Assume for the moment that we also have an estimate 
of the blur kernel operating on each layer, when the camera is focused on the 
other one. Let layer fi be superimposed on layer f 2 - We consider only the slices 
ga and gb, in which either layer /i or layer / 2 , respectively, is in focus. The other 
layer is blurred. Modeling the blur as convolution with blur kernels, 

9a = fl+ f2* h2a gb = h + fl* ^Ib ■ (1) 

In the frequency domain Eqs. take the form 

Ga=Fi+H2aF2 Gb = F 2 + H^Fi . (2) 

The naive inverse filtering solution of the problem is 

A=B{Ga-GbH2a) F2=B{Gb-GaHu) , (3) 

where 

B={1- HuH2a)~" ■ (4) 

As the frequency decreases, H 2 aH\b — f 1, and then B — >■ oo, hence the solution 
is unstable. Moreover, due to energy conservation, the average gray level (DC) 
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is not affected by defocusing {H 2 aH\b = 1), hence its recovery is ill posed. How- 
ever, the problem is well posed and stable at the high frequencies. This behavior 
also exists in separation methods that rely on motion USES, as expected from 
the discussion in section f2. IL Note that this is quite opposite to typical recon- 
struction problems, in which instability and noise amplification appear in the 
high frequencies. 

If H 2 aH\b yf 1 (that is, except at the DC), B can be approximated by the 
series 

m 

B{m) = ^ {HuH 2 a)'^~" ■ (5) 

fc=i 

The approximate solutions Fi{m), F 2 {m) are thus parameterized by m which 
controls how close the filter B{m) is to the inverse filter, and is analogous to reg- 
ularization parameters in typical inversion methods. We define the basic solution 
as the result of using m = 1. 

Another approach to layer separation is based on using as input a pinhole 
image and a focused slice, rather than two focused slices. Acquiring one image 
via a very small aperture ( “pinhole camera” ) leads to a simpler algorithm, since 
just a single slice with one of the layers in focus is needed. The advantage is 
that the two images are taken without changing the axial positions of the sys- 
tem components, hence no geometric distortions arise. The “pinhole” image is 
described by 

90 = (/i + f2)/a , (6) 

where 1/a is the attenuation of the intensity due to contraction of the aperture. 
This image is used in conjunction with one of the focused slices of Eq. (QJ), for 
example ga- The inverse filtering solution is 

F, = S{Ga-aGoH 2 a) F 2 = S{aGo-Ga) , (7) 

where 

S={l-H2a)~^ ■ (8) 

Also in this method the filter S can be approximated by 

m 

S{m) = ^ H^^-\ (9) 

k=l 



2.3 Blind Estimation of the Transfer Functions 

The imaging PSFs (and thus their corresponding transfer functions) may be 
different than the ones we estimate and use in the reconstruction algorithm. As 
shown in j1 an error in the PSF leads to contamination of the recovered 
layer by its complementary. The larger B is, the stronger is the amplification 
of this disturbance. Note that B(rn) monotonically increases with m, within 
the support of the blur transfer function if HibF[ 2 a > 0, as is the case when 
the recovery PSFs are Gaussians. Thus, we may expect that the best sense of 
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separation will be achieved in the basic solution, even though the low frequencies 
are less attenuated and better balanced with the high frequencies at higher m’s. 

We wish to achieve self-calibration, i.e., to estimate the kernels out of the 
images themselves. This enables blind separation and restoration of the layers. 
Thus, we need a criterion for layer separation. It is reasonable to assume that 
the statistical dependence of the real and virtual layers is small since they usu- 
ally originate from unrelated scenes. Mutual information measures how far the 
images are from statistical independence m- We thus assume that if the layers 
are correctly separated, each of their estimates contains minimum information 
about the other. Mutual information was suggested and used as a criterion for 
alignment in im, where its maximum was sought. We use this measure to 
look for the highest discrepancy between images, thus minimizing it. To decrease 
the dependence of the estimated mutual information on the dynamic range and 
brightness of the individual layers, it was normalized by the mean entropy of 
the estimated layers, when treated as individual images. This measure, denoted 
I„, indicates the ratio of mutual information to the self information of a layer. 
Additional details are given in ITCETI . 

The recovered layers depend on the kernels used. Therefore, seeking the ker- 
nels can be stated as a minimization problem: 

[^ib, ^ 2 a] = arg min I„(/i,/ 2 ) . (10) 

h\b,h2a 

As noted above, errors in the kernels lead to crosstalk (contamination) of the 
estimated layers, which is expected to increase their mutual information. To 
simplify the problem, the kernels can be assumed to be Gaussians. Then, the 
kernels are parameterized only by their standard deviations (proportional to the 
blur radii). The blurring along the sensor raster rows may be different than the 
blurring along the columns. So, generally we assigned a different blur “radius” 
to each axis. If two slices are used, there are two kernels, and the optimization 
is done over a total of four parameters. When a single focused slice is used in 
conjunction with a “pinhole” image, the problem is much simpler. We need to 
determine the parameters of only one kernel, and the factor a. a can be indicated 
from the ratio of the f-numbers of the camera in the two states, or from the ratio 
of the average values of the images. 



2.4 Recovery Experiments 

A print of the “Portrait of Doctor Gachet” (by van-Gogh) was positioned closely 
behind a glass window. The window partly reflected a more distant picture, a 
part of a print of the “Parasol” (by Goya). The cross correlation between the 
raw (focused) images is 0.98. The normalized mutual information is ~ 0.5 
indicating that significant separation is achieved by the focusing process, but 
that substantial crosstalk remains. The basic solution (m = 1) based on the 
optimal parameters is shown at the middle row of Fig. 0 It has ~ 0.006 (two 
orders of magnitude better than the raw images). Using to = 5 yields better a 
balance between the low and high frequency components, but increased to 
about 0.02. As noted above, the theory 1 1 til21)j indicates that an error in the PSF 
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Fig. 2. [Top]: The slices in which either of the transparent layers is focused. [Middle 
row[: The basic solution (m = 1). [Bottom row[: Recovery with m = 5. 
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Fig. 3. [Top]: Raw images: the far layer is focused when viewed with a wide aperture, 
and with a “pinhole” setup. [Bottom]: The basic recovery. 



model, yields a stronger crosstalk for larger m. Hence this increase in may 
originate from the inaccuracy of our assumption of a Gaussian model. 

In another example, the scene consisted of a print of the “Portrait of Armand 
Roulin” as the close layer and a print of a part of the “Miracle of San Antonio” 
as the far layer. Here we used a fixed focus setting, and changed the aperture 
between image acquisitions. The slice in which the far layer is focused (using the 
wide aperture) is at the top-left of Fig.|3 and the “pinhole” image is to its right. 
The basic solution based on the optimal parameters is shown on the bottom of 
Fig. El The “Portrait” is revealed. 

3 Labeling by a Polarization Cne 

3.1 Recovery from a Known Inclination Angle 

Distance cues do not indicate which of the layers is the reflected (virtual) one, and 
which is the transmitted (real) one. However, polarization cues give us a simple 
way to achieve both the layer separation and their labeling as real/virtual. At 
the semi-reflecting medium (e.g, a glass window) the reflection coefficients are 
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different for each of the light polarization components, and are denoted by R± 
and i?|| for the polarization components perpendicular and parallel to the plane 
of incidence, respectively. They can be analytically derived as functions of the 
surface inclination (angle of incidence, </?) from the Fresnel equations |l 712 1 ] . 
taking into account the effect of internal reflections in the medium. The 

transmission coefficient of each component is given by 

f=l-R . (11) 

We denote the image due to the real layer (with no window) by It and 
the image due to the virtual layer (assuming a perfect mirror replacing the 
window) by Ir. The light coming from the real object is superimposed with the 
light coming from the virtual object. Let the scene be imaged through a linear 
polarizer, by which we can select to sense the perpendicular or the parallel 
components {g± and gy, respectively) of the observed light coming from the 
scene. Thus, the two raw images are: 



g± = IrR±/2 + ItTt/2 5|| = IrR\\/2 + /tT||/2 , (12) 

for initially unpolarized natural light. Solving these equations for the two images 
we obtain the estimated intensities of the layers as a function of an assumed angle 
of incidence, ip: 



Irip) = 



2R±{p) 

Rj_{if) -R\\{p) 



511 



‘2R\\iV’) 

R±{p) - R\\{p) 



91 - 



iRiV’) 



2 - 2Rii{ip) 

R±{ip) -R\\{(p) 



2-2Rt{p) 
R±{p) - Riiip) 



(13) 

(14) 



Therefore, the layers are recovered by simple weighted subtractions of the raw 
images. Moreover, the equation for It is distinct from the equation for Ir, so 
the separate layers are automatically labeled as reffected/transmitted (i.e., vir- 
tual/real). Note, however, that the weights in these subtractions are functions 
of the angle at which the invisible (but semireffecting) medium is inclined with 
respect to the viewer. 



3.2 The Inclination of the Invisible Surface 

In case the angle of incidence used in the reconstruction process is not the true 
inclination angle of the surface :/5truej each recovered layer will contain traces of 
the complimentary layer (crosstalk). In an experiment we performed pi9l21| . a 
scene composed of several objects was imaged through an upright glass window. 
The window semi-reffected another scene (virtual object). A linear polarizer 
was rotated in front of the camera between consecutive image acquisitions. The 
reflected layer is attenuated in g^\ (Fig. 0 but its disturbance is still significant, 
since iptiue = 27.5° ± 3°, was far from the Brewster angle 56° (at which the 
reflection disappears in gy). Having assumed that the true angle of incidence is 
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Fig. 4. The raw images. [Left]: g±. [Right]: Although the reflected component is smaller 
in g|| , the image is still unclear. 



unknown, ip was guessed. As seen in Fig. El when the angle was underestimated 
negative traces appeared in the reconstruction (bright areas in In are darker 
than their surroundings in It)- When the angle was overestimated, the traces 
are positive (bright areas in In are brighter than their surroundings in It)- When 
the correct angle is used, the crosstalk is removed, and the labeled layers are well 
separated. 

An automatic way to detect this angle is by seeking the reconstruction that 
minimizes the mutual information between the estimated layers, in a similar 
manner to the procedure of section r3..3l 

(,5 = arg min [I t ( 73 ), /fl(</5)] . (15) 

In this experiment is minimized at ip = 25.5°. The angle at which the esti- 
mated layers are decorrelated is (/? = 27° (Fig. El). Both these values are within 
the experimental error of the physical measurement of :/3true- Note that the cor- 
relation sign is consistent with the “positive/negative” traces when the assumed 
angle is wrong. 

The reconstruction of the real layer (It) in the experiment is more sensitive 
to an error in the angle of incidence, than the reconstruction of the virtual layer 
(In)- In Fig. O the contamination in the estimated Ir by It is hardly visible, 
if at all. On the other hand, the contamination in the estimated It by In is 
more apparent. This result is consistent with a theoretical prediction published 
in Ref. [2 1 j . In that work, the relative contamination of each recovered layer by 
its complimentary was derived. It was normalized by the “signal to noise ratio” 
of the layers when no polarizer is used. We outline here just the final result of 
the first-order derivation from Ref. m Fig.0 plots (solid line) the first order 
of ct-i which is the theoretical relative contamination of the transmitted layer 
by the virtual layer. It is much larger than the relative contamination cn of the 
recovered reflected layer by the real layer (dashed line). 

According to Fig. El at (/Jtrue = 27°, It will be ~ 8% contaminated by In per 
1° error in ip- Thus for the 10° error of Fig. El we get ~ 80% contamination (if 
the first order approximation still holds), when the reflected contribution is as 
bright as the transmitted one. If the reflection is weaker, the contamination will 
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Fig. 5. Experimental results. When the assumed angle of incidence is correct 
(93 = (ptrue = 27°), the separation is good. In cases of under-estimation or over- 
estimation of the angle, negative or positive traces of the complementary layer appear, 
respectively. This is also seen in the increase of mutual information and in the corre- 
lation of the estimated layers. The traces are stronger in It than in In, in consistency 
with Fig. 0 
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Fig. 6. The relative contamination of the each layer, per 1° of error in the angle of 
incidence, if the reflected contribution is as bright as the transmitted one (after the 
incidence on a glass window). 



be proportionally smaller. On the other hand. In will be just ~ 3% contaminated 
by It at 10° error from (ptrue, when the reflected contribution is as bright as the 
transmitted one. This is the reason why in this experiment Ir appears to be 
much more robust to the angle error than It- 

4 Discussion 

This paper concentrated on the analysis of multi-valued images that occur when 
looking through a semi-reflecting surface, such as a glass window. We have shown 
that two raw images suffice to separate the two contributing layers. Polarization 
cues enable the labeling of each of the layers as real or virtual. They also enable 
the extraction of information on the clear semi-reflecting surface itself (its incli- 
nation in space). However, it seems to be very difficult to use this approach if the 
problem is scaled to more than two contributing layers. Another shortcoming of 
this approach is that it is not applicable if the transparent layers do not involve 
reflections (as occur in volumetric specimens). 

The distance cue can easily be scaled for cases where there are more than two 
layers. Actually, it is used in volumetric specimens (which may have a continuum 
of layers), based on the focus cue. Our demonstration is definitely not limited 
to the focus/defocus cue, since defocus blur, motion blur, and stereo disparity 
have similar origins m and differ mainly in the scale and shape of the kernels. 
Therefore, the success of the methods that are based on defocus blur is an 
encouraging step towards understanding and demonstrating the estimation of the 
motion PSFs or stereo disparities in transparent scenes from as few as 2 images, 
and recovering the layers from them. However, if a small baseline suffices to 
separate the layers, then a method based on defocus blur may be preferable since 
depth from defocus is more stable with respect to perturbations and occlusions 
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than methods that rely on stereo or motion, for the same physical dimensions 
of the setup PI. On the other hand, methods that rely on distance cues seem 
to have an inherent ill-conditioning at the low frequency components, and a 
labeling ambiguity in cases of semi-reflections. 

In microscopy and in tomography, the suggested method for self calibration 
of the PSF can improve the removal of crosstalk between adjacent slices. How- 
ever, in these cases significant correlation exists between adjacent layers, so the 
correlation criterion may not be adequate. This is therefore a possible direction 
for future research. 

Since methods relying of distance cues somewhat complement methods that 
rely of polarization cues, fusion of these cues for separating semi-reflections is 
a promising research direction. The ability to separate transparent layers can 
be utilized to generate special effects. For example, in Ref. 1 ( )] images were 
rendered with each of the occluding (opaque) layers defocused, moved and en- 
hanced independently. The same effects, and possibly other interesting ones can 
now be generated in scenes containing semireflections. 
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Abstract. Towards segmentation from multiple cues, this paper 
demonstrates the combined use of color and symmetry for detecting 
regions of interest (ROI), using the detection of man-made wooden 
objects and the detection of faces as working examples. A func- 
tional that unifies color compatibility and color-symmetry within 
elliptic supports is defined. Using this functional, the ROI detection 
problem becomes a five-dimensional global optimization problem. 
Exhaustive-search is inapplicable due its prohibitive computational 
cost. Genetic search converges rapidly and provides good results. The 
added value obtained by combining color and symmetry is demonstrated. 



1 Introduction 

Most segmentation methods associate a scalar measurement or a vector of mea- 
surements with each pixel, characterizing its grey level, color or the texture to 
which it belongs. Once the measurement space has been defined, segmentation 
can be viewed as an optimization problem: regions should be uniform internally, 
different from adjacent ones and “reasonable” in their number and shapes. For- 
malizing these vague concepts in the form of an objective function and devising 
an efficient way to perform the search are both difficult problems. 

The gestalt school suggested grouping principles that guide human percep- 
tual organization. They include similarity, proximity, continuation, symmetry, 
simplicity and closure. Incorporating the gestalt principles in machine vision is 
an attractive idea |2S1. In particular, the gestalt principles relate to global shape 
properties and represent a-priori visual preferences, issues that a successful seg- 
mentation method should address. Note, however, that studies in human percep- 
tual organization are often limited to binary images, commonly to dot patterns. 
In computer vision, using binarized images requires successful edge detection, 
which is largely equivalent to segmentation, thus relying on the unknown. Ap- 
plication of the gestalt principles to image segmentation is thus desirable, but 
not straightforward. 

Symmetry is one of the gestalt grouping principles. It appears in man made 
objects and is also common in nature [noni- The omnipresence of symmetry 
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has motivated many studies on symmetry in images, see e.g., |21I22I26I32| . The 
possibility of rapidly finding symmetric areas in raw gray level images, as shown 
in |22|. encourages the use of symmetry as a cue for segmentation. However, 
symmetry is not always maximal where expected. One example in |22| shows 
greater symmetry between a tree and its shadow than the symmetry of the tree 
itself. This indicates that symmetry alone is insufficient and that additional cues 
should be used. Our long term goal is to develop a unified low-level vision module, 
in which several basic visual tasks, each difficult when carried out separately, 
assist one another towards accomplishing their missions. This will simplify vision 
system design, require less top-down feedback and lead to more stable and robust 
performance. 

This paper is a step towards computationally efficient symmetry aided seg- 
mentation. The idea of carrying out segmentation in conjunction with symmetry 
detection is not new in itself [I isp2 1 122|2SI31 j . but the concept is still in its in- 
fancy and much remains to be studied. Our approach is quite general, but is 
presented here in the context of two specific vision tasks: the detection of man- 
made wooden objects, and the well-studied frontal- view face detection task I2ISI 
IE). Starting from a color image, the similarity of each pixel color to wood color 
or to skin color can be quantified. Many algorithms for face detection based on 
skin color are available, e.g. Face detection using symmetry has 

also been considered, e.g. mn\. Our interest is in the added value obtained by 
performing segmentation in conjunction with symmetry detection, rather than as 
separate processes. Progress in this direction has recently been described in jn|. 

To maintain the generality of the method presented, we intentionally ignore 
highly specific and application-dependent cues that can be very useful in par- 
ticular applications, such as the position and exact shape of facial features in 
the case of face detection. The algorithm can thus be applied, with minimal 
changes, to other computer vision problems, where roughly symmetric objects, 
characterized by some uniformity in an arbitrary measurement space, have to 
be rapidly detected. 

Let D denote an elliptical domain at any location, orientation and scale 
within a color image I. D is thus characterized by five parameters. Let S{D) de- 
note a measure of the symmetry of the image within D, with respect to the major 
axis of D, taking color into account. Let C{D) quantify the color-compatibility 
of D, i.e., the dominance of wood-colored (or skin-colored) pixels within D. 
Define a functional F{D) = T{S{D)^C{D)^D} that combines symmetry, color 
compatibility and size. The global maximum of F{D) corresponds to some ellip- 
tical domain D* in which the image is highly symmetric and exhibits high color 
compatibility. The operational goal is to efficiently find D* . 



2 Color Compatibility 

Identifying the “best” color-space for grouping tasks such as skin segmentation is 
still controversial. We carried out a modest performance evaluation, using images 
taken locally and some images from the University of Stirling face database m- 
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Fig. 1. Light grey: A scatter-diagram showing the position in the rgb space (normalized 
RGB) of the skin pixels in 35 face images. Some of the images were imported from the 
Stirling database jlOj . a few were taken locally. Black: The clustering of wood colors, 
taken from 7 images of objects made of dark and light wood. Dark grey: Positions in 
which skin and wood colors overlap. 



We compared the following color-spaces: RGB, YES TSV rgb (nor- 
malized RGB: NRGB) |^, HSV, XYZ, L*U*V* and xyz | 3 . It turned out that 
YES, rgb and TSV were most useful for skin segmentation, rgb yielding the best 
results in our tests. The rgb color space is defined by 

^ “ R+G+B ’ 9 = R+G+B ’ ^ = R+G+B (^) 

The transformation from RGB to rgb is nonlinear. All values are normalized by 
intensity (R+G + B) and b = 1 — r — g. Note that skin colors typical to people 
of different origins, including Asian, African American and Caucasian cluster in 
the rgb color space |E|. Furthermore, the rgb color space is insensitive to surface 
orientation and illumination direction jSj. The light-grey points in Fig. E depict 
the scatter in the rgb color-space of skin-pixels from 35 face images. Colors of 
different wood types also cluster in the rgb space (black points). Note the overlap 
between skin and wood colors (dark grey) . 

Let i(i,j) = [r(i,j),g{i,j)]^ denote the normalized color vector at a specific 
pixel (i,j). We wish to obtain c(i,j), a measure of the similarity of t(i,j) to 
a given family of colors, e.g., wood colors or skin colors. The method used is 
inspired by a skin detection algorithm presented in Cl , but we use the rgb color- 
space while cn uses YES. More important, in Cl c(i,j) is a binary function, 
classifying pixels as either members or non-members of the family, while here 
intermediate similarity values are accommodated. 
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Table 1. The mean vector /x and the covariance matrix K for skin and wood. 



Family 

Skin 

Wood 



oU 

0.31 

0 ^ 

0.35 



K 

9.00 0.67 
0.67 2.51 
10.2 - 0.6 
-0.6 2.7 



10 ” 



10 " 



The class-conditional probability density function of skin-colored pixels can 
be reasonably modeled by a two dimensional Gaussian m where the mean 
vector and the covariance matrix K are estimated from an appropriate training 
set. Our small-scale experiments indicate that a 2-D Gaussian model is suitable 
for wood-color as well. Equal-height contours of the color-family probability 
density function are ellipses in the r-g plane, whose centers are all at fi and 
whose principal axes depend on K. Table Q] shows the mean vector /x and the 
covariance matrix K obtained for skin and wood from Fig. n 

The similarity measure c{i,j) is taken as the color- family probability density 
of f(f, j), i.e. 

c{i,j) = exp|-^ [f(i,j) - tif K~^ - /x]| (2) 

Visualizations of c{i,j) are shown in Fig. El Given an image domain D, its color 
compatibility is quantified as 

E '(i.j). (3) 



3 Color Symmetry Measurement 



The mirror-symmetry of a continuous scalar function f{x,y) with respect to the 
x-axis can be measured by a reflectional correlation coefficient 



// 


f{x,y)f{x,-y)dxdy 




fj 


^ f{x,y)dxdy 



( 4 ) 



Note that any function / can be expressed as the sum of a fully symmetric 
component and a perfectly anti-symmetric component fas] it is easy to show 
that 

. ll/.lP-11/g.lP 
^ ll/.|P + ll/a.|P ■ 

For non-negative functions f,SfG [0,1]. 



( 5 ) 
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Fig. 2. Top: Two color images (grey-level versions shown). Bottom: The corresponding 
skin (left) or wood (right) similarity functions c{i,j). 



Symmetry measurement with respect to an arbitrary axis t, in a translated 
and rotated coordinate system (t,s), can be implemented by alignment of the 
(t, s) system with the (x, y) system, i.e., translation and rotation of the relevant 
sub-image to a standard position. In this research, local symmetry is measured 
in elliptic domains. This is accomplished by multiplying the relevant sub-image, 
in the standard position, by an elliptical Gaussian window 



G{x,y,r^,ry) 



2TTr^ry 









(6) 



where (rx,ry) are referred to as the effective radii of the elliptic support. 

Variations in illumination intensity over the scene distort symmetry measure- 
ments based on grey levels. This phenomenon must be taken into consideration 
in face image analysis, since most face images are taken indoors, with great spa- 
tial variability in the illumination intensity. One novel aspect of this research 
is symmetry measurement of color images. By using the rgb (normalized RGB) 
color-space, the symmetry measured is that of a vector field of normalized color 
components. This compensates for spatial intensity variations. 
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The reflectional color symmetry measure of a color image f in a domain D is 
defined as 

.. \\r\\^Sr{D) + \\9rS,{D) + \\b\\^St{D) 

^ ’ IMP + llffIP + ll&IP 

i.e., the weighted average of the 2-D symmetry values measured in the r, g and 
b normalized color components with respect to the energy in each component 
(within D). 

Given the image f(i, j), the measure S{D) of color symmetry in an elliptical 
domain Z3 is a function of five parameters: the center coordinates of D, the 
effective radii corresponding to D and the orientation of D, i.e., the angle between 
its major axis and the a;-axis. Observe that S{D) is in itself scale-invariant, 
reflecting the fact that magnification has no effect on symmetry. 

4 Objective Function 

As defined, the measures of color compatibility C{D) and color symmetry S{D) 
within an elliptic domain D are scale invariant. Therefore, the symmetry and 
color compatibility associated with a tiny symmetric area of the right color will 
be higher than in a larger support, in which both symmetry are color compat- 
ibility are not as perfect. This is an undesirable state of affairs, since in any 
relevant image one may And many unimportant tiny symmetric regions of the 
right color, and at the limit each single pixel is perfectly symmetric and uniform 
in color. Regions of interest for image segmentation are usually much larger. 
Scale dependence should therefore be built into the objective function. The ob- 
jective function is also the point of choice for imposing application-specific a- 
priori knowledge and preferences, possibly expressed via a function P{D). Thus, 
the objective function takes the general form F{D) = T{S{D), C{D),D, P(D)}. 
The objective function used in this research is of the form 

F{D) = S'^{D)-&{D)-\\D\\-P{D) (8) 

where k and I are positive integers. P{D) was used, for example, to limit the ratio 
between the length of the major and minor axes in an elliptic approximation of 
a human face. 

5 Global Optimization 

Given an image, the objective function is a highly complex, multimodal function 
of five parameters: the coordinates of the centroid of the supporting region, 
its effective radii and its orientation. Maximizing the objective function is an 
elaborate global optimization problem. Solving this problem by exhaustive search 
is computationally prohibitive. 

We implemented both a conventional genetic search algorithm m and a 
variation of the probabilistic genetic algorithm (PGA) described in Both 
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Fig. 3. The suggested algorithm applied as a face detector. Note that, to maintain 
generality, the method relies only on symmetry and color-htness; facial features are 
not used. Grey level versions of the actual color images are shown. Top-left: By mea- 
suring symmetry in the normalized color channels, large illumination variations can be 
accommodated. Top-right: Since facial features are not used, glasses pose no difficulty 
other than a slight reduction in color compatibility. Middle-left: Added value provided 
by the color cue. Middle-right: Added value brought by the symmetry cue. Bottom-left: 
Asymmetric background. Bottom-right: Cluttered background. 






Fig. 4. The suggested algorithm as a detector of man-made wooden objects. Grey level 
versions of the actual color images are presented. 




Fig. 5. Limitations of the suggested algorithm. Grey level versions of the actual color 
images are presented. Left: Gonvergence to a local maximum. Right: Perspective pro- 
jection leads to skewed symmetry. The chess board is an extreme case in which skewing 
turns symmetry to anti-symmetry. In this example, parts of the chessboard are suffi- 
ciently symmetric locally, but (the image of) the whole chessboard is not symmetric. 
Accommodating skewed symmetry in the method is straightforward, but would require 
higher dimensional search. 
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algorithms perform quite well. The PGA tends to be faster than the standard ge- 
netic algorithm in the initial stages, but its final convergence seems to be slower. 
Typically, only about 3000 evaluations of the objective function are needed to 
reach the global maximum with either algorithm. Considering that 32 bits are 
used to describe each hypothesis (7 bits for each of the the location parameters 
x,y and 6 bits for each of the other three parameters), computing time is re- 
duced by 6 orders of magnitude with The images in Figs. 02] demonstrate the 
performance of the algorithm. Some of its limitations are shown in Fig.0 

6 Discussion 

Towards segmentation from multiple cues, this paper demonstrates the combined 
use of symmetry (global feature) and color (local property) for detecting regions 
of interest. Frontal-view face detection and the detection of man-made wooden 
objects have been used as working examples, but generality has been carefully 
maintained and the approach is not limited to those applications. One novel 
aspect of the suggested approach is the measurement of chromatic symmetry, 
thus compensating for illumination variations in the scene. Note that, unlike 
previous works, symmetry is analyzed in conjunction with skin detection rather 
than sequentially. The integration of the symmetry and color cues takes place in 
a unified objective function. 

Great computational savings are obtained by avoiding exhaustive search. 
The global optimization method used can be extended 1 1.^1 to detect more 
than one region of interest in the image. Further computational gains can be 
achieved by caching values of the objective function, thus avoiding unnecessary 
recomputation. An interesting direction for future research is the extraction of 
smoothness properties of the objective function. These could lead to global op- 
timization with guaranteed convergence m 
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Abstract. The paper concerns 2D-3D pose estimation in the algebraic 
language of kinematics. The pose estimation problem is modeled on the 
base of several geometric constraint equations. In that way the projective 
geometric aspect of the topic is implicitly represented and thus, pose 
estimation is a pure kinematic problem. The authors propose the use 
of motor algebra to model screw displacements of lines or the use of 
rotor algebra to model the motion of points. Instead of using matrix 
based LMS optimization, the development of special extended Kalman 
filters is proposed. In this paper extended Kalman filters for estimating 
rotation and translation of several constraints in terms of rotors and 
motors will be presented. The experiments aim to compare the use of 
different constraints and different methods of optimal estimating the 
pose parameters. 



1 Introduction 

The paper describes the estimation of pose parameters of known rigid objects 
in the framework of kinematics. The aim is to experimentally verify advantages 
of extended Kalman filter approaches versus linear least squares optimizations. 
Pose estimation in the framework of kinematics will be treated as nonlinear opti- 
mization with respect to geometric constraint equations expressing the relations 
between 2D image features and 3D model data. 

Pose estimation is a basic visual task. In spite of its importance it has been 
identified for a long time (see e.g. Crimson jS|), and although there is published 
an overwhelming number of papers with respect to that topic 0, up to now 
there is no unique and general solution of the problem. In a general sense, pose 
estimation can be classified into three categories: 2D-2D, 3D-3D, and 2D-3D. In 
the first and second category, both the measurement data and the model data 
are 2D or 3D, respectively. In the third category fall those experiments where 
measurement data are 2D and model data are 3D. This is the situation we will 
assume. 

An often made assumption is that of rigidity of objects. The wellknown kine- 
matic model of rigid body transformation is a natural one. It consists of rotation 
and translation. On the other hand, the visual data result from perspective pro- 
jection, which normally can be modeled using a pinhole camera model. 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 153-^^^ 2001. 

@ Springer- Verlag Berlin Heidelberg 2001 



154 G. Sommer, B. Rosenhahn, and Y. Zhang 



In this paper we attend to a pose estimation related to estimations of mo- 
tion as a problem of kinematics. The problem can be linearly represented in 
motor algebra 0 or dual quaternion algebra [B|. We are using implicit formula- 
tions of geometry as geometric constraints. We will demonstrate that geometric 
constraints are well conditioned and, thus behave more robust in case of noisy 
data. 

Pose estimation is an optimization problem, formulated in either linear or 
nonlinear manner, or as either constraint or unconstraint technique. In case of 
noisy data, which is the standard case in practice, nonlinear optimization tech- 
niques are preferred |H|. We will use extended Kalman filters because of their 
incremental, real-time potential for estimation. In that respect it will be of in- 
terest that the estimation error of the fulfillment of the considered geometric 
constraints keeps a natural distance measure of the considered entities to the ac- 
tual object frame. Thus, EKF based estimation of geometric constraints permits 
a progressive scheme of pose estimation. 

The paper is organized as follows. In section two we will introduce the mo- 
tor algebra as representation frame for either geometric entities, geometric con- 
straints, or Euclidean transformations. In section three we introduce the geo- 
metric constraints and their changes in an observation scenario. Section four 
is dedicated to the geometric analysis of these constraints. In section five we 
will present the EKF approaches for estimating the constraints. In section six 
we compare the performance of different algorithms for constraint based pose 
estimation. 



2 The Algebraic Frame of Kinematics 

In our comparative study we will consider the problem of pose estimation as a 
kinematic one. In this section we want to sketch the modeling of rigid body mo- 
tions in the framework of motor algebra, a special degenerate geometric algebra 
with remarkable advantages. 



2.1 The Motor Algebra as Degenerate Geometric Algebra 

We introduce the motor algebra as the adequate frame to represent screw trans- 
formations in line geometry 0. This algebra belongs to the family of geometric 
algebras, a variant of Clifford algebras in which the geometric interpretation of 
operations is dominantly considered Em. 

A geometric algebra Gp,q,r is a linear space of dimension 2^^ , n = p + q + r, 
with a rich subspace structure, called blades, to represent so-called multivectors 
as higher order algebraic entities in comparison to vectors of a vector space 
as first order entities. A geometric algebra Gp,q,r results in a constructive way 
from a vector space R", endowed with the signature {p,q,r), n = p + q + r by 
application of a geometric product. The geometric product consists of an outer 
(a) and an inner (•) product, whose role is to increase or to decrease the order 
of the algebraic entities, respectively. 
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To make it concretely, a motor algebra is the 8D even subalgebra of ^3,0,1, 
derived from i.e. n = 4, p = 3, g = 0 , r = 1, with basis vectors 7^, fc = 1, 4, 
and the property 7^ = = +1 and 7I = 0 . Because 7I = 0 , ^3,07 is 

called a degenerate algebra. The motor algebra ^ is of dimension eight and 
spanned by qualitative different subspaces with the following basis multivectors: 
one scalar : 1 

six bivectors : 7273 , 737i , 7i72 , 74 7i > 7472 , 7473 
one pseudoscalar : I = 71727374- 

Because 7I = 0 , also the unit pseudoscalar squares to zero, i.e. = 0 . Re- 
membering that the hypercomplex algebra of quaternions IH represents a 4D 
linear space with one scalar and three vector components, it can simply be ver- 
ified that I is isomorphic to the algebra of dual quaternions H, HH. Each 
dual quaternion q G M is related with quaternions qr,qd G IH by q = qr + 1 qd- 
It is obvious from that isomorphism that also quaternions have a representation 
in geometric algebra, just as complex and real numbers have. Quaternions cor- 
respond to the 4D even subalgebra of ^3,0,0, derived from R^. They have the 
basis {1,7273,7371,7172}- The advantage of using geometric algebra instead of 
diverse hypercomplex algebras is the generality of its construction and, derived 
from that, the existence of algebraic entities with unique interpretation whatever 
the dimension of the original vector space. 

More important is to remark that the bivector basis of motor algebra con- 
stitutes the basis for line geometry using Pliicker coordinates. Therefore, motor 
algebra is extraordinary useful to represent line based approaches of kinematics, 
also in computer vision. 

The motor algebra is spanned by bi vectors and scalars. Therefore, we will 
restrict our scope to that case. Let be A, B, C G (^3^0. 1)2 bivectors and a, j 3 
G 1)0 scalars. Then the geometric product of bivectors A, B G (^3 0.1)21 
AB, splits into AB = A-B + AxB + AAB, where A • B is the inner product, 
which results in a scalar A- B = a, AAB is the outer product, which in this case 
results in a pseudoscalar A A B = J/3, and A x B is the commutator product, 
which results in a bivector C, A x B — ^ {AB — BA) = C. Changing the sign 

of the scalar and bivector in the real and the dual parts of the motor leads to 

the following variants of a motor 

AI = (oq -I- ct) T(^o T AL = (oq — o) -I- T(&o — 

A1 = (oq cb) — I {bo b) Ad = (uq — ct) — -f(bo — b). 

These versions will be used to model the motion of points, lines and planes. 

2.2 Rotors, Translators, and Motors 

In a general sense, motors are called all the entities existing in motor algebra. 
Thus, any geometric entity as points, lines, and planes have a motor represen- 
tation. We will use the term motor in a more restricted sense to call with it a 
screw transformation, that is an Euclidean transformation embedded in motor 
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algebra. Its constituents are rotation and translation. In line geometry we rep- 
resent rotation by a rotation line axis and a rotation angle. The corresponding 
entity is called a unit rotor, R, and reads as follows 



i? = ro -I- ri 7273 -I- r 2737 i -l- r 37 i 72 = cos 





Here 6 is the rotation angle and n is the unit orientation vector of the rotation 
axis in bi vector representation, spanned by the bi vector basis. A unit rotor is in 
geometric algebra a general entity with a spinor structure, representing rotation 
in terms of a specified plane. It exists in any dimension and it works for all types 
of geometric objects, just in contrast to rotation matrices, complex numbers or 
quaternions. Its very nature is that it is composed by bivectors B and that there 
is an exponential form R = ±exp {\B). 

If on the other hand, t = ti7273 -I- ^27371 + ^371 72 is a translation vector in 
bivector representation, it will be represented in motor algebra as the dual part 
of a motor, called translator T with 




Thus, a translator is also a special kind of rotor. 

Because rotation and translation concatenate multiplicatively in motor alge- 
bra, a motor M reads 



M = TR^ R + I^R = R + IR'. 

A motor represents a line transformation as a screw transformation. The line L 
will be transformed to the line L' by means of a rotation Rg around line Lg by 
angle 9 , followed by a translation tg parallel to Lg. The screw motion equation 
as motor transformation reads 0 . 0 

L' = TgRgLRgTg = MLM. 



2.3 Motion of Points, Lines, and Planes in Motor Algebra 

First, we will introduce the description of the important geometric entities | 7 |. 

A point X G R^, represented in the bivector basis of ^ ^ ^Jo,i> 

reads X = 1 -|- X17471 -I- X27472 + 2^37473 = 1 + Ix. 

A line L € 4 is represented by L = n -|- Im with the line direction 

n = ni 7273 - 1 - 7127371 -1-7437172 and the moment m = 77117273-1-77427371-1-77137172. 

A plane R G 1 will be defined by its normal p as bivector and by its 
Hesse distance to the origin, expressed as the scalar d = (x ■ p), in the following 
way, R = p + Id. 

Note that the fact of using line geometry does not prevent to define points 
and planes, just as in point geometry the other entities also are well defined. In 
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case of screw motions M = TgRs not only line transformations can be modeled, 
but also point and plane transformations. These are expressed as follows. 

X' = 1 + Ix' = MXM = M(1 + Ix)M = 1 + I{R,xRs + t,) 

L' = n' + Im! = M L M = RguRg + I{RsnRs + R'^nRs + RsTuRs) 

P' ^p' + Id' = MPM = M{p + Id)M ^ RspRs + I{{RspRs) ■ ** + d). 

We will use in this study only point and line transformations because points and 
lines are the entities of our object models. 

3 Geometric Constraints and Pose Estimation 

First, we make the following assumptions. The model of an object is given by 
points and lines in the 3D space. Furthermore we extract line subspaces or points 
in an image of a calibrated camera and match them with the model of the object. 
The aim is to find the pose of the object from observations of points and lines 
in the images at different poses. Figure [H shows the scenario with respect to 
observed line subspaces. 

We want to estimate the rotation and the translation parameters which lead 
to the best fit of the model with the extracted line subspaces or points. To 
estimate the transformations, it is necessary to relate the observed lines in the 
image to the unknown pose of the object using geometric constraints. 

The key idea is that the observed 2D entities together with their correspond- 
ing 3D entities are constraint to lie on other, higher order entities which result 
from the perspective projection. In our considered scenario there are three con- 
straints which are attributed to two classes of constraints: 

1. Collinearity: A 3D point has to lie on a line (i.e. a projection ray) in the 
space 

2. Coplanarity: A 3D point or a 3D line has to lie on a plane (i.e. a projection 
plane) . 

With the terms projection ray or projection plane, respectively, we mean the 
image-forming ray which relates a 3D point with the projection center or the in- 
finite set of image-forming rays which relates all 3D points belonging to a 3D line 
with the projection center, respectively. Thus, by introducing these two entities, 
we implicitly represent a perspective projection without necessarily formulating 
it explicitly. Instead, the pose problem is in that framework a purely kinematic 
problem. A similar approach of avoiding perspective projection equations by 
using constraint observations of lines has been proposed in |2t3ll . 

In the scenario of figure d we describe the following situation: We assume 
3D points Yi, and lines Si of an object or reference model. Further we extract 
points and lines in an image of a calibrated camera and match them with the 
model. 
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Fig. 1. The scenario. The solid lines at the left hand describe the assumptions: the 
camera model, the model of the object and the initially extracted lines on the image 
plane. The dashed lines at the right hand describe the actual pose of the model. 

Table 1. The geometric constraints in motor algebra and dual quaternion algebra. 



constraint 


entities 


dual quaternion algebra 


motor algebra 


point-line 


point X = 1 + Ix 
line L = n + Im 


LX - XL = 0 


XL-LX = 0 


point-plane 


point X = 1 + Ix 
plane P = p + Id 


PX - XP = 0 


PX - XP = 0 


line-plane 


line L = n + Im 
plane P = p + Id 


LP-PL = 0 


LP + PL = 0 



Tabled gives an overview on the formulations of these constraints in motor 
algebra, taken from Blaschke 0, who used expressions in dual quaternion al- 
gebra. Here we adopt the terms from section 2. The meaning of the constraint 
equations is immediately clear. They represent the ideal situation, e.g. achieved 
as the result of the pose estimation procedure with respect to the observation 
frame. With respect to the previous reference frame these constraints read 

{MYM)L - L{MYM) = 0 

P{MYM) - {MYM)P = 0 

{MSM)P + P{MSM) = 0. 

These compact equations subsume the pose estimation problem at hand: find 
the best motor M which satisfies the constraint. With respect to the observer 
frame those entities are variables of the measurement model of the extended 
Kalman filter on which the motors act. 
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4 Analysis of Constraints 

In this section we will analyze the geometry of the constraints introduced in 
the last section in motor algebra. We want to show that the relations between 
different entities are controlled by their orthogonal distance, the Hesse distance. 
This intuitive result is not only of importance for formulating a mean square 
minimization method for finding the best motor satisfying the constraints. But 
in case of noisy data the error of that task can be immediately interpreted as 
that Hesse distance. 

4.1 Point-Line Constraint 

Evaluating the constraint of a point X = 1 + Ix collinear to a line L = n + Im 
leads to 



0 = XL — LX = I{m — n X x). 

Since J 0, although 1 ^ = 0 , the aim is to analyze the bi vector m — n x x. 
Suppose X ^ L. Then, nonetheless, there exists a decomposition x = x^ + X2 
with Xi — (l + /a;i) S L and X2 — (1 + Ja;2) -L L. Figure 0shows the scenario. 
Then we can calculate 

\\m — n X x\\ = \\m — n x {x-^ + X2)\\ = || — n x a;2|| = ||a: 2 ||- 

Thus, satisfying the point-line constraint means to equate the bivectors m and 
n X X, respectively making the Hesse distance ||a;2|| of the point X to the line 
L to zero. 




Fig. 2. The line L consists of the direction n and the moment m = nxv. Further, there 
exists a decomposition a: = a: 1 -I- a: 2 with Xi = (l-|~/a:i) € L and X2 ~ {I + IX2) -L L, 
so that m = n X V = n X xi. 



4.2 Point-Plane Constraint 

Evaluating the constraint of a point X = 1 + Ix coplanar to a plane P = p+ Id 
leads to 



0 = PX — XP = I{2d + px + xp) = I{d + p ■ x). 
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Since / 7 ^ 0, although 7^ = 0, the aim is to analyze the scalar d + p- x. Suppose 
X ^ P. The value d can be interpreted as a sum so that d = doi + do 2 and doip 
is the orthogonal projection of x onto p. Figure 0 shows the scenario. Then we 




Fig. 3. The value d can be interpreted as a sum d = doi +do 2 so that dgip corresponds 
to the orthogonal projection of x onto p. That is doi = —p ■ x. 



can calculate 

d + p ■ X = doi + do2 + P ■ X = doi + p ■ X + do2 = do2. 

The value of the expression d + p ■ x corresponds to the Hesse distance of the 
point X to the plane P. 

4.3 Line-Plane Constraint 

Evaluating the constraint of a line L — n + Im coplanar to a plane P — p + Id 
leads to 

0 = LP + PL = np + pn + I{2dn — pm + mp) = n ■ p + I{dn — p x m) 

Thus, the constraint can be partitioned in one constraint on the real part of the 
motor and one constraint on the dual part of the motor. The aim is to analyze 
the scalar n ■ p and the bivector dn — {p x m) independently. Suppose L ^ P. 
If n / p the real part leads to 

n - p= — ||n|| IIpII cos(a) = — cos(a), 

where a is the angle between L and P, see figure El If n T p, we have n p = 0. 

Since the direction of the line is independent of the translation of the rigid 
body motion, the constraint on the real part can be used to generate equations 
with the parameters of the rotation as the only unknowns. The constraint on 
the dual part can then be used to determine the unknown translation. In other 
words, since the motor to be estimated, M = R+IRT = R+IR' , is determined 
in its real part only by rotation, the real part of the constraint allows to estimate 
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the rotor R, while the dual part of the constraint allows to estimate the rotor 
R! . So it is possible to sequentially separate equations on the unknown rotation 
from equations on the unknown translation without the limitations, known from 
the embedding of the problem in Euclidean space 0. This is very useful, since 
the two smaller equation systems are easier to solve than one larger equation 
system. 

To analyze the dual part of the constraint, we interpret the moment m of 
the line representation L — n + Im as m — n x s and choose a vector s with 
S' = (1 + Is) G L and s T n. By expressing the inner product as the anti- 
commutator product, it can be shown that —{p x m) = {s ■ p)n — {n ■ p)s. 
Now we can evaluate 

dn — [p X m) = dn — (n ■ p)s (s • p)n. 

Figure 0 shows the scenario. Further, we can find a vector Si with s || Si, so 




Fig. 4. The plane P consists of its normal p and the Hesse distance d. Furthermore 
we choose S = (1 -I- Is) € L with s T n. 



that 



0 = d- (||s|| -h ||si||)cos(/3). 

The vector Si might also be antiparallel to s. This leads to a change of the sign, 
but does not affect the constraint itself. Now we can evaluate 

dn — {n- p)s + {s ■ p)n = dn — ||s|| cos(/3)n -|- cos(a)s = ||si|| cos(/3)n -|- cos(a)s. 

The error of the dual part consists of the vector s scaled by the angle a and the 
direction n scaled by the norm of Si and the angle /3. 

If n T p, we will find 

\\dn - {p X m)\\ = \\dn -h (s • p)n - {n ■ p)s|| = |(d-|- s • p)| 

This means, in agreement to the point-plane constraint, that {d+s-p) describes 
the Hesse distance of the line to the plane. 
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This analysis shows that the considered constraints are not only qualitative 
constraints, but also quantitative ones. This is very important, since we want to 
measure the extend of fulfillment of these constraints in the case of noisy data. 

5 The Extended Kalman Filter for Pose Estimation 

In this section we want to present the design of EKFs for estimating the pose 
based on three constraints. Because an EKF is defined in the frame of linear 
vector algebra, it will be necessary to map the estimation task from any chosen 
algebraic embedding to linear vector algebra (see e.g. |2|). 

5.1 EKF Pose Estimation Based on Point-Line Constraint 

In case of point based measurements of the object at different poses, an algebraic 
embedding of the problem in the 4D linear space of the algebra of rotors g, 
which is isomorphic to that one of quaternions H, will be sufficient |7IH] . Thus, 
rotation will be represented by a unit rotor R and translation will be a bivector 
t. A point yi transformed to reads Xi = Ry\R + t. We denote the four 
components of the rotor as 



R = ro + riCT20'3 + C2(T3cri + r^uia2- 

To convert a rotor R into a rotation matrix R., simple conversion rules are at 
hand: 

( rl + - rl - rl 2(rir2 + rgrg) 2(rir3 - ror2) \ 

2(rir2-rgr3) rl - r\ + rl - rl 2(r2r3 + rgri) . 

2(rir3 + ror 2 ) 2(r2r3 - rori) + r| / 

In vector algebra, the above point transformation model can be described as 



xi = 7?.yi +t. 



The projection ray in the point-line equation is represented by Pliicker co- 
ordinates (rii,mi), where rii is its unit direction and mi its moment. The 
point-line constraint equation in vector algebra of reads 



fi = mi — rii X xi = mi — rii x (7?.yi -1- t) = 0. 



Let the state vector s for the EKF be a 7D vector, composed in terms of the 
rotor coefficients for rotation and translation, 

s = = (rg,ri,r2,r3,ti,t2,i3)^- 

The rotation coefficients must satisfy the unit condition 

f2 = R^R — 1 = rg -I- ri -I- -I- — 1 = 0. 
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The noise free measurement vector aj is given by the actual line parameters rii 
and mi, and the actual 3D point measurements yi, 

ai = (ni"^, mi"^, yi^)^ = {n^i^na, na, mil,m^2, ma, yn,yi2, Visf ■ 



For a sequence of measurements aj and states si, the constraint equations 



fi(ai,Si) 




/mi - rii X (7?-iyi + ti) 
^^Ri'^Ri- 1 



= 0 



relate measurements and states in a nonlinear manner. The system model in this 
static case should be Si+i = Si + (^i, where Ci is a vector random sequence with 
known statistics, E[Ci] = 0, S[CrCk] = Qi<5ik, where <5ik is the Kronecker delta 
and the matrix Qi is assumed to be nonnegative definite. We assume that the 
measurement system is disturbed by additive white noise, i.e., the real observed 
measurement ai is expressed as a[ = ai + ry;. 

The vector is an additive, random sequence with known statistics, if [rji] = 
E[nJ'nv\ = Wi(5ik, where the matrix Wi is assumed to be nonnegative defi- 
nite. 

Since the observation equation is nonlinear (that means, the relationship 
between the measurement a( and state Si is nonlinear), we expand fi(ai,Si) into 
a Taylor series about the (al, Si/i_i), where al is the real measurement and Si/i_i 
is the predicted state at situation i. By ignoring the second order terms, we get 
the linearized measurement equation 



Zi — RiSj -f- 



where 



Zi = fi(a;,S;/i_i) ^ Si/i_i 

_ ( ™-|,” ^ + ti/i-l) \ 



1- 



The measurement matrix Hi of the linearized measurement zi reads 

^ _ 9fi(aj, Si/i_i) ^ ^ Cr^'^'D Cn' 



9si 



X> 



R 



->1x3 



where 



^R = 



- l^i/i-1 1) 

9R3 



= (-2f(i/i_i)0 -2f, 



(i/i-l)2 



-2f, 



(■i/i— 1)3 






T> ~ 



W/i-iyO 

9Ri 



di (I 2 d,3 d,4 

d,4 —da d,2 —di 
—ds —d4 di d2 
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di = — ‘r{i/i-i)2v'a), 

d.2 = + r^(i/i_l)22/i2 + ^(i/i-l)32/i3)i 

dz = 2(— r(i/i_i)2i/ii + ^{iii-\)iv'i2 ~ %/i-i)o2/j3)j 

^4 = 2( — + T{ili-\){^yi2 + %/i-l)l2/j3)- 

The 3x3 matrix is the skew-symmetric matrix of n';. For any vector y, we 
have Cjj^y = ni x y with 




The measurement noise is given by 






afi(a'i,Sj/i_i) 

9aj 

[ ^X;/i_i Isx3 
V 0ix3 0ix3 



(a; - a'i) = 

0lx3 



9fi(a'i,Si/i_i) 

9ai 




Vi 



where Isxs is a unit matrix and is the skew-symmetric matrix of Xi/i_i 

with 



B/i-l — ^i/i - lYi + 



The expectation and the covariance of the new measurement noise are easily 
derived from that of a[ as 



S[li] = 0 and = Vi = ( 






i9ai 



-)VWi(- 



9ai 



The EKF motion estimation algorithms based on point-plane and line-plane 
constraints can be derived in a similar way. 



5.2 EKF Pose Estimation Based on Point-Plane Constraint 

The projection plane P 12 in the point-plane constraint equation is represented 
by (di,pi), where di is its Hesse distance and pi its unit direction. The point- 
plane constraint equation in vector algebra of reads 

di — pi^(7?.xi -f t) = 0. 

With the measurement vector ai = (di, Pi^, yi^)^ and the same state vector s 
as above, the measurement Zi of linearized measurement equation reads 

( Pr("^i/i - ryj + ti/i-i) 

\^i/i — 1 ~ 1 



Zi = 



4- ^i®i/i— 1- 
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The measurement matrix Hi of the linearized measurement Zi now reads 



Hi = 






Tty' P* 

0ix3 



The measurement noise is given by 

e _ /^ 1 -("^i/i - lYi + - i) ^ 

Vo 0lx3 0ix3 72^7 



5.3 EKF Pose Estimation Based on Line-Plane Constraint 

Using the line-plane constraint, the reference model entity in Q ^ 0,1 m is the 
Pliicker line S'! = rii -|- Irrii^. This line transformed by a motor M — R+ IR' 
reads 



Li = MS\M = RrixR + I{RriiR' + R'riiR + RrriiR) = + Ivi. 

We denote the 8 components of the motor as 

M = To + ri7273 -f r2737i -f T37 i 72 + I{rQ + r[j 2 'l 3 + ?'2737i + f37i72)- 

The line motion equation can be equivalently expressed by vector form, 

Ui = 7?.ni and Vi = ^4ni -t- 7?.mi, 



with 

( ail ai2 fli3\ 

021 0,22 023 1 J 
031 O 32 033 / 

On = 2(r^ro -f r[n - r^r 2 - r'^rs), 012 = 2(r^ro -h r^ri -h r[r 2 + t'oTs), 

Oi3 = 2(-r^ro -f - r(,r2 -f r'ir3), 021 = 2(-r^ro -f r'^ri + r[r2 - r'^rs), 

022 = 2(r^ro - r[n + r^r 2 - r'^rs), 023 = 2 {r[ro + r'^n + r'^r 2 + r'^rs), 

031 = 2 (r^ro -f r^n -k r^r 2 -f r'ir3), 032 = 2 (-r'iro - r'^ri + r'^r2 + r^r3), 

033 = 2(r^ro - r[n - r^r 2 -f r'^rs). 



The line-plane constraint equation in vector algebra of reads 



T 

Pi Ui 



Pi^(7?.ni) 

di7?.ni -h (Aril -h 7?.mi) x pi 



diUi -f vi X Pi 

We use the 8 components of the motor as the state vector for the EKF, 

s = (ro,ri,r2,r3,7’o,7’i,r2,r3)^ 



= 0 . 



and these 8 components must satisfy both the unit and orthogonal conditions: 

f3 = rl + rl + rl + rl-l = 0, 
fi = Tor'^ -I- rir'i -|- T2r2 + rsrl, = 0. 
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The lOD noise free measurement vector aj is given by the true plane parameters 
di and pi, and the true 6D line parameters (rii, mi), 



The new measurement in linearized equation reads 



/ p'i^(7ti/i_ in'i) 



Zi = 






1- 



- in'i + {Ai/i _ in'i + 7?-i/i _ im.) x p- 

„ T 

_ iRi/i— 1 1 

\Ri/i - lRi/i-1 

The measurement matrix Hi of the linearized measurement Zi reads 



/ 



Hi 






^Pi ^1x4 

0lx4 

'Dr/ 5^R ) 



dCFlj/i _ in'i) 

aRi 



9(7ti/i_im'i) 



"here = 'at.' 



°( A;i - i"j) 

aRi 



= 



a(Ri/i _ iR'i/i_i) 



and IT>r 



a(Ri/i_iRG;_i 



aRi aihh.2-R- 

The 3x3 matrix Cp{ is the skew-symmetric matrix of pj. The measurement 
noise is given by 



0 

= ( Hi/i_ in'i 

02x3 



n'i^Hi/i _ 1 
Cvi 
02x3 



p'i^Hi/i _ 1 



0 



1x3 



d'ifti/i - 1 — CpiA-i/i _ 1 — Cp'Hi/i _ 1 j ^i 

02x3 02x3 



where C<ij is skew-symmetric matrix of Vi, and Vi is defined as 

Vi = Ai/i _ in. -h Hi/i _ imj. 

Having linearized the measurement models, the EKF implementation is straight- 
forward and standard. Further implementation details will not be repeated here 
In next section, we will denote the FKF as RtFKF, if the state explic- 
itly uses the rotor components of rotation R and of translation t, or MFKF, if 
the components of motor M are used. 



6 Experiments 

In this section we present some experiments with real images. The aim of the 
experiments is to study the performance of the algorithms for pose estimation 
based on geometric constraints. We expect that both the special constraints and 
the algorithmic approach of using them may influence the results. 
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Fig. 5. The scenario of the experiment: The calibration of an object model is performed 
and the 3D object model is projected on the image. Then the camera moved and 
corresponding line segments are extracted. 



Table 2. The experiment 1 results in different qualities of derived motion parameters, 
depending on the used constraints and algorithms to evaluate their validity. 




In our experimental scenario we positioned a camera two meters in front of 
a calibration cube. We focused the camera on the calibration cube and took an 
image. Then we moved the camera, focused the camera again on the cube and 
took another image. The edge size of the calibration cube is 46 cm and the image 
size is 384 x 288 pixel. Furthermore, we defined on the calibration cube a 3D 
object model. Figure 0 shows the scenario. In the left images the calibration is 
performed and the 3D object model is projected on the image. Then the camera 
is moved and corresponding line segments are extracted. In these experiments 
we actually selected certain points by hand and from these the depicted line 
segments are derived and, by knowing the camera calibration by the cube of 
the first image, the actual projection ray and projection plane parameters are 
computed. 

For the first experiment we show in table |2I the results of different algorithms 
for pose estimation. In the second column of table |2|EKF denotes the use of the 
EKFs derived in section 5, MAT denotes matrix algebra, SVD denotes the sin- 
gular value decomposition of a matrix. In the third column the used constraints. 
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point-line (XL), point-plane (XP) and line-plane (LP) are indicated. The fourth 
column shows the results of the estimated rotation matrix 7?. and the transla- 
tion vector t, respectively. The translation vectors are shown in mm. The fifth 
column shows the error of the equation system. Since the error of the equation 
system describes the Hesse distance of the entities, the value of the error is an 
approximation of the squared average distance of the entities. It is easy to see, 
that the results obtained with the different approaches are all very close to each 
other, though the implementation leads to totally different calculations and al- 
gorithms. Furthermore the EKF’s perform more stable than the matrix solution 
approaches. 

The visualization of some errors is done in figure El We calculated the motion 
of the object and projected the transformed object in the image plane. The 
extracted line segments are overlayed in addition. Figure 0 shows the results of 
nos. 5, 3, 7 and 8 of table El respectively. 




Fig. 6. Visualization of some errors. The results of nos. 5, 3 7 and 8 of table 0 are 
visualized respectively. 



In a second experiment we compare the noise sensitivity of the Kalman filters 
and of the matrix solution approaches for pose estimation. The experiment is 
organized as follows. We took the point correspondences of the first experiment 
and estimated both 7?. and t. Then we added a Gaussian noise error on the 
extracted image points. The error varied from 0 to 16 pixels in 0.25 steps and 
we estimated TV and t' for each step. Then we calculated the error between 
TV and Ti. and between t' and t. The results are shown in figure 0 Since Ti- 
and TV are rotation matrices, the absolute value of the error differs in the range 
0 < < 1. The error of the translation vector is evaluated in mm. So the error 

of the translation vector differs by using the matrix solution approach at around 
0 < et < 10 cm, while using the Kalman filter the corresponding range is 0 < 
et < 6 cm. The matrix based solutions look all very similar. Compared with the 
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EKF results they are very sensitive to noise and the variances between the noise 
steps are very high. The results are in agreement with the well known behavior of 
error propagation in case of matrix based rotation estimation. The EKF based 
solutions perform all very stable and the behavior of the different constraints 
are also very similar. This is a consequence of the estimators themselves and of 
the fact that the concatenation of rotors is more robust than that of rotation 
matrices. It is obvious, that the results of these experiments are affected by 
the method to obtain the entities in the image. In this experiment we selected 
certain points directly by hand and derived from these the line subspaces. So 
the quality of the line subspaces is directly connected to the quality of the point 
extraction. For comparison purposes between the algorithms this is necessary 
and reasonable. But for real applications, since the extraction of lines is more 
stable than that of points, the XP or LP algorithms should be preferred. 





Fig. 7. Performance comparison of different methods in case of noisy data. With in- 
creasing noise the EKF performs with more accurate and more stable estimates than 
the matrix based methods. 



7 Conclusions 

In this paper we describe a framework for 2D-3D pose estimation. The aim of 
the paper is to compare several pose modeling approaches and estimation meth- 
ods with respect to their performance. The main contribution of the paper is to 
formulate 2D-3D pose determination in the language of kinematics as a prob- 
lem of estimating rotation and translation from geometric constraint equations. 
There are three such constraints which relate the model frame to an observation 
frame. The model data are either points or lines. The observation frame is con- 
stituted by lines or planes. Any deviations from the constraint correspond the 
Hesse distance of the involved geometric entities. From this starting point as a 
useful algebraic frame for handling line motion, the motor algebra has been intro- 
duced. The estimation procedure is realized as extended Kalman filters (EKF). 
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The paper presents EKFs for estimating rotation and translation for each con- 
straint model in different algebraic frames. The experiments show advantages of 
that representation and of the EKF approaches in comparison to normal matrix 
based LMS algorithms, all applied within the context of constraint based pose 
estimation. 
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Abstract. We present quantitative results for computing local least 
squares and global regularized range flow using both image and range 
data. We first review the computation of local least squares range flow 
and then show how its computation can be cast in a global Horn and 
Schunck like regularization framework unj. These computations are 
done using both range data only and using a combination of image 
and range data 1141 . We present quantitative results for these two least 
squares range flow algorithms and for the two regularization range 
flow algorithms for one synthetic range sequence and one real range 
sequence, where the correct 3D motions are known a priori. We show 
that using both image and range data produces more accurate and more 
dense range flow than the use of range flow alone. 



1 Introduction 

We can use image sequences to compute optical flow in a local least squares 
calculation, for example, Lucas and Kanade |S], or in a global iterative regu- 
larization, for example, Horn and Schunck |^. In addition to the use of image 
intensity data, it is possible to use densely sampled range sequences m to com- 
pute range flow. Range data (for example from a Biris range sensor |^) consists 
of 2D arrays of the 3D coordinates (in millimeters) of a scene, i.e. the 3D X, 
Y and Z values, plus the grayvalue intensity at those same points. Since our 
range sensor acquires images under orthographic projection we can only com- 
pute image flow (orthographic optical flow) rather than perspective optical 
flow, although the same algorithms can be used in both cases. Just as opti- 
cal/image flow can be computed from time- varying image data PP, range flow 
can be computed from time varying range data HZ|. The Biris range sensor is 
based on active triangularization using a laser beam and on a dual aperture 
mask. It has a reported depth accuracy of about 0.1mm for objects at a dis- 
tance of 250mm |2|. This paper investigates the computation of range flow on 
one synthetic range sequence and on one real range sequence made with a Biris 
range sensor using regularization on both range and/or intensity derivatives. We 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 171-^^^ 2001. 

@ Springer- Verlag Berlin Heidelberg 2001 
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also show how local and global optical flow computations can be extended into 
3D, allowing the calculation of dense accurate range flow fields, often when the 
number of individual range velocities is sparse. 

Although the work reported here was performed with Biris range sensor data, 
there is no reason why our algorithms could not be used with other sources of 
time-varying depth information, such as depth maps from stereo m or motion 
and structure algorithms. Here we assume locally rigid objects (although 
both of our sequences have globally rigid objects). Instead of computing camera 
motion parameters and overall scene motion, we are interested in computing 
the range flow, i.e. the 3D velocity, at each point the depth data is sampled at. 
Towards this end, we start with the range constraint equation of mmi- 

2D optical flow methods have recently been generalized into the 3D domain. 
Chaudhury et al. P] formulated at 3D optical flow constraint, using I^, ly, Iz 
and It derivatives. Thus they have a time- varying volume of intensity where all 4 
derivatives can be computed. A lot of this work has been medically motivated, for 
example, to compute 3D flow for CT, MRI and PET datasets [1 Ill2llbl7j . Since 
range flow is computed with respect to a moving 3D surface, derivative data in 
the Z dimension is not available, resulting in different constraint equations for 
range data and for 3D optical flow. 

The basic algorithms used in this paper have been reported in more detail 
elsewhere: 

1. Quantitative flow analysis using the Lucas and Kanade least squares calcu- 
lation and the Horn and Schunck regularization were reported in P|. 

2. The computation of full range flow (and its two types of normal flow) in 
a total least squares framework was reported in ini- Here, the range flow 
calculation is reformulated in in a least squares framework P). 

3. The direct regularization was presented in HB| for a number of different 
sequences, including a real sequence made from the 3D motion of a growing 
caster oil bean leaf using a Biris range sensor. 

4. The computation of range flow from both intensity and range data in both a 
total least squares framework (as opposed to a least squares framework used 
here) and a regularization framework is reported in pi 4] . 

We examine the quantitative performance of these algorithms on two inten- 
sity/range sequences: a synthetic sequence where the range and intensity data 
was error-free, yielding good flows and a real sequence where both the range 
and intensity data are poor. In the later case, we also know the true 3D velocity 
and are thus able to quantitatively analyze the flow. The results are quite good 
when the combined intensity and range data are taken into account, especially 
when one considers that the range structure is very poor at most locations (the 
surfaces are planar). 

2 2D Image Flow 

The well known motion constraint equation: 

IxU + lyV + It = 0 



( 1 ) 
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forms the basis of most optical flow algorithms. lx, ly and It in equation (0 
are the x, y and t intensity derivatives while v = (u, v) is the image velocity 
(or optical flow) at pixel (x,y), which is an approximation of the local image 
motion. Equation m is 1 equation in 2 unknowns and manifests the aperture 
problem. Raw normal velocity (the component of image velocity normal to the 
local intensity structure) can be totally expressed in terms of derivative infor- 
mation: 



11 + II 



( 2 ) 



while tangential velocity, Vt cannot, in general, be recovered. 

To solve for v we need to impose an additional constraint. An example of a 
local constraint is to assume that locally all image velocities are the same. For 
example, Lucas and Kanade |S| use a least squares computation to integrate local 
neighbourhoods of normal image velocities into full image velocities. For a n x n 
neighbourhood, they solve a n x 2 linear system of equations A„x 2 'W = Bnxi as 



V = {A^A)-^A^B, 



( 3 ) 



where A has entries Ixi and lyi in the row and B has entries —In in the 
row. We perform eigenvector/eigenvalue analysis on A using routines in |0|. 
Eigenvalue (Aq < Ai) and corresponding eigenvector (cq and ei) decomposition 
of the symmetric matrix A yields least squares full image velocity, if both 
Ao > T]ji and Ai > r^i, or least squares normal image velocity, vin = v ■ e\, 
if Ai > Tui but Ao < te)i. On the other hand, Horn and Schunck jS| impose a 
global smoothness constraint on the optical flow field and minimize: 



J j {IxU + lyV + Itf + a^{ul + ul + vl + vl)dxdy. 



( 4 ) 



We can minimize this functional using Euler-Lagrange equations (with and 
approximated as m — u and v — v respectively) as; 



Ixly 


u 




{Au - Ixh) 


Ixly 


V 




(Av - lylt) _ 






yielding the Gauss Seidel iterative equations: 



u 

V 



n+1 * 
n+1 



= A-i 



(a^rt" 

(a^F” 



Ut) 

lylt). ■ 



( 6 ) 



3 3D Range Flow 

Biris range data consists not only of 3D coordinate (A, Y, Z) data of an envi- 
ronmental scene but also intensity data for each of those environmental points. 
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The motion constraint equation can easily be extended into the range constraint 
equation jT^ in 3 D: 

ZxU + ZyV + W + Zt = Q, ( 7 ) 

where V = {U,V,W) is the 3 D range velocity and Zx^ Zy and Zt are spatio- 
temporal derivatives of the depth coordinate Z . Raw normal velocity can also 
be computed directly from Z derivatives as 



^ -Zt(Zx.Zy,l) 

Zl + Z^ + l ■ 



( 8 ) 



For a n X n neighbourhood, we can solve a n x 3 linear system of equations 

~ -^nxl 

V = {A^A)-^A^B, ( 9 ) 

where A has entries Zxu and 1 in the row and B has entries —Za in 
the row. Alternatively to this least squares computation a total least squares 
approach may be used M- The eigenvalues (Aq < Ai < A2) and their corre- 
sponding eigenvectors (cq, ei and 62) can be computed from the 3 x 3 symmetric 
matrix A^A and then used to compute least squares full range velocity, V p, 
when Ao,Ai,A2 > Tp>2, an estimate of least squares line normal velocity, V l, 
when Ai, A2 > td2, Aq < tjj2 and an estimate of the least squares plane normal 
velocity, V p, when A2 > Tp>2, Aq, Ai < Tp>2- The terms line and plane normal 
range velocity are motivated by the fact that these types of normal velocity 
always occur on lines or planes on the 3 D surface. That is: 



Vp = {V- eo)eo + {V • ei)ei + {V ■ 62)62 ( 10 ) 

Vl = {V-A)A + {V- 62)62 ( 11 ) 

Vp = {V- 62)62. (12) 

Of course V is Vp. This computational scheme breaks down if A^A cannot be 
reliably inverted as then we cannot compute V as required in equations m 
to (II 211 . Below we outline how to compute line and planar normal flow when 
A"^ A is nearly singular. We can rewrite the eigenvalue/eigenvector equation, 
A^ABi = Xi6i, as 





eo 




Ao 


0 0 ■ 


A^A 


61 


= A^ar = 


0 


Ai 0 




A‘2 




_ 0 


0 A 2 



where R = [60,61,62]^. Thus we can rewrite J 3 ) as: 



Ao 0 0 
0 Ai 0 
0 0 A 2 



V = B', 



( 13 ) 



( 14 ) 



where F' = (t/',F',W') = R^V and R' = (&(,, 6'i, ^'2)^ = R^A^B. If Aq is 
small, Ao < td2, Ai,A2 > td2, we are dealing with a line normal velocity. 
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Vl = {Ul, Vl, Wl). Then the 2"^' and 3’''^ equations of (d give two equations 
that define constraint planes that the normal velocity must lie in. The line normal 
is given by the point on their intersecting line with minimal distance from the 
origin. The direction of this line is given by eo = ei x C 2 , which yields a third 
equation. The system of equations to be solved is: 

Vf = ewUL + eiiVL + ei2WL=^ (15) 

Ai 

Wt = e2iUL + e^^VL + e22WL=^ (16) 

A2 

eoiCf^L + eoiVi + eo2lTL = 0. (17) 



If both Aq and Ai are less than td 2 then we can only compute planar normal 
range flow. In this case, we have one constraint: 



&2oU p + 621 Vp + 622 bFp 



A 2 



(18) 



The plane normal flow is the point on this plane with minimal distance from the 
origin: 



Vp = 



^2 

^ A 2 

®20 + ^ 21 + ^22 
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b'2 
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621 


_622_ 


A2 


_622_ 



(19) 



Since A is a real, positive semi-definite, symmetric matrix, eigen- 
value/eigenvector decomposition always yields real positive eigenvalues. 



4 Least Squares Image-Range Flow 

We note that if we compute derivatives of intensity with respect to X and Y , 
rather than x and y (the projection of X and Y on the sensor grid) the motion 
constraint equation becomes: 



IxU + IyV + It = Q, (20) 

where U and V are the first two components of range flow. Since a Biris sensor’s 
images are made under orthographic projection we use standard optical flow as 
image flow. ([/, V) can then be recovered by a least squares calculation. They 
are the first two components of range flow and are orthographic image velocity 
(which we call image flow). If we use equations of the form in (t^l )ll and ()3l 
whenever intensity and/or depth derivatives reliably available, we obtain a least 
squares linear system of equations for U , V and W in terms of the spatio- 
temporal intensity and depth derivatives. We require at least one equation of 
the form in equation 0 be present to constrain the W parameter. We use (3 to 
weigh the contribution of the depth and intensity derivatives in the computation 
of ([/, V, W) so that they are of equal influence. We solve for {U, V, W) using least 
squares as outlined above, checking the eigenvalues against a third threshold, 
TD 3 - 
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5 Direct Regularized Range Flow 



We can compute regularized range flow directly using the spatio-temporal deriva- 
tives of Z by minimizing 



{ZxV -t -f W -t Ztf -t a{U\ -f [/y -f t/|-t 



Vf -t V# -t y| -t -t W# -t Wl)dXdYdZdt. ( 21 ) 

We can write the Euler-Lagrange equations using the approximations V^C/ = 
Uxx + Uyy + Uzz ^ U — U , = Vxx + Vyy + Vzz k, V — V and and 

= Wxx + Wyy + Wzz ~W — W respectively as: 



(a^ -1- Zf) ZxZy Zx 




' U' 




'{a^U-ZxZtf 


ZxZy (cr^ -I- Zf) Zy 
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{c?V -ZYZt) 


Zx Zy (o^ -I- 1) 
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(a^W - Zt) 
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The Gauss Seidel equations then become: 



- fjn+l - 




_ ZxZt) 


yn+1 




(a^C" - ZYZt) 






(oAW'^ - Zt) 



( 22 ) 



(23) 



6 Combined Range Flow from Intensity and Range 
Derivatives 

It is possible to compute V using both intensity and range derivatives via equa- 
tions CO) and 13) and the same smoothness term given in equation (f2 1 II . We 
regularize: 



(ZxU + ZyV + W + Ztf + f3flxU + IyV + hf 

W{Ul + U^ + Ul + +Vf +V^ + E|-t 
Wx + Wy + Wi)dXdYdZdt, 



The Euler-Lagrange equations are 





'U' 




'a^U-ZxZt-P^IxIt' 
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a^V - ZYZt - 
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1 
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where A is 

■ Zi + /32/2. + c,2 ZxZy + P^IxIy Zx 
ZxZy + P^IxIy Zy 0^ if + 0^ 

Zx Zy 1 -I- 



(24) 



(25) 



( 26 ) 
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The Gauss Seidel equations are then 



- iJn+1 - 




'a^U^-ZxZt-(3^IxIt' 


yn+l 


= A-i 


aV" - ZyZt - P^Iylt 


l^n+l 




-Zt- 



The matrix A ^ only has to be computed once in equations 11 '/!, SB and (l‘/!7B , exis- 
tence of the inverse is guaranteed by the Sherman-Morrison- Woodbury formula 

m 

7 Differentiation 

The use of a good differential kernel is essential to the accuracy of both image 
and range flow calculations. We use the balanced/matched Alters for preflltering 
and differentiation proposed by Simoncelli A simple averaging Alter 5 , |] 
was used to slightly blur the images before preflltering/differentiation. The pre- 
flltering kernel’s coefficients were (0.0356976, 0.2488746, 0.4308557, 0.2488746 
and 0.0356976) while the differential kernel’s coefficients were (—0.107663, 
—0.282671, 0.0, 0.282671 and 0.107663). For example, to compute we first 
convolve the preflltering kernel in the t dimension, then convolve the preflltering 
kernel on that result in the y dimension and Anally convolve the differentiation 
kernel in the x dimension on that result. We assume a uniform sampling of the 
Z data in X and Y\ in general, this is not true (but is true for our data). 



8 Error Measurement 



We report 2D error for image flow and 3D error for range flow using relative 
magnitude error (as a percentage) and angle error (in degrees). V c is the correct 
image/range flow and V f. is the estimated or computed image/range flow in the 
equations below. For magnitude error we report: 

V'M = ***^'|i\7ll^'**"* "" (28) 

II cl|2 

while for angle error we report: 

1 /;^ = arccos(Ec • fe)- (29) 

For line normal range flow we compute an estimated correct line flow as: 

(E,-ei)ei + (Ec- 62 ) 62 . (30) 



Of course e\ and 62 have error in themselves as they are computed from the 
least squares integration matrix. We then report magnitude and angle error as 
given in equations and liZVil . Finally, for planar normal range flow we can 
only compute the planar magnitude error: 



i^P3D 



E,-yp-||yp||2 



I|Ep||2 



X 100%, 



(31) 
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as the direction of the computed and estimated correct plane flow are always the 
same (the direction of the eigenvector corresponding to the largest eigenvalue). 
We also examine 4>abs^ the average absolute error: 

N 

= ^||Vc-t>p||2-||Fp||2 (32) 

9 Synthetic Range Flow Results 

To test our algorithms, we made a synthetic range sequence where we know 
the correct 3D translation (0.4, 0.6, 0.9 units/frame), allowing quantitative error 
analysis. In Figure we show the depth map synthetically generated while 
Figure^ shows the corresponding image data. Each line in the depth map has 
a Gaussian profile - this is made by simply rotating the coordinates into the line 
and then applying the appropriate exponential function. The motion in Z is done 
afterwards by simply adding the appropriate motion, thus W is globally constant 
for this sequence. The image data was made by simply overlaying two sinusoids 
with perpendicular orientations with the correct XY motion, thus V = {U, V, W) 
is globally constant. 




Fig. 1. Synthetic depth map without texture and (b) a sinusoid texture. 



Figure Eli through El show the computed XY and XZ full, line and planar 
range flow for this sequence (section OJ while Table [H gives their quantitative 
magnitude (percentage) and direction (angle) error measures. We use the pro- 
jected correct flow in the direction of computed eigenvectors as “correct” line 
and plane flow. These are good estimates of correct plane range flow but not so 
good for line flow. 

Figures El., b shows the computed Horn and Schunck and Lucas and Kanade 
flow fields (section El for the two image sequences while Table El shows their 
quantitative error. The local least squares image-range flow results (section^ are 
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Fig. 3. The image flow computed using (a) Horn and Schunck method (1000 iter- 
ations) and (b) Lucas and Kanade’s method (rni = 1.0) for the synthetic sequence. 
Flows (c) and (d) show the XY and XZ components of range flow for the synthetic 
sequence computed using the least squares image-range flow calculation. 



also shown in TableEland FiguresEfc,d. These range flow results are quite good, 
better than Horn and Schunck image flow. This is quite remarkable, considering 
that we are computing 100% dense 3D range flow (compared with 100% dense 
2D image flow) . Table 0 shows the magnitude and angle results for the Direct 
(sectionED and Combined (sectionEI) regularization methods for 1000 iterations. 
Results for the Combined regularization are the best (but not as good as the 
Least Squares optical-range calculation) . These results indicate that using both 
image and range data is the best way to recover accurate 3D velocity fields. We 
used q; = /3 = 1.0 for all regularizations. 
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Fig. 4. The computed XY and XZ components of Direct and Combined Regularized 
Range Flow for 1000 iterations. 



10 Real Range Flow Results 

We also have one real range sequence which we made in 1997 at NRC in Ot- 
tawtQ, Each image of this sequence is 454 x 1024 and was made by moving a scene 
(consisting of some boxes wrapped in newspaper) a set of fixed equal distances 
on a linear positioner and after each movement, taking intensity and range im- 
ages. Thus, the correct 3D translation (0.095377, 1.424751,0.087113) mm/frame 
is known, allowing quantitative error analysis. NRC’s Biris range sensor was also 
mounted on another linear positioner and at each time three sets of four overlap- 
ping (intensity and X, Y and Z) images were acquired. These images are then 
manually viewed and joined into one larger image (some partially overlaid data 
was discarded) . A sheet of white paper was also imaged and used to correct the 

^ Thanks to Luc Cournoyer at NRC for helping us make this data. 
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(a) (b) 



Fig. 5. (a) The smoothed subsampled intensity image for frame 25 of the NRC sequence 
and (b) its corresponding depth (Z) image. 

Table 1. Direction and magnitude error of the computed full, line and plane range 
velocities wrt the estimated “correct” full, line and range flow for the synthetic range 
sequence. 



Full Range Velocity {td2 = 0.2) I 


4>m 

4>a 

Density 


0.045% ± 0.005% 
0.007° ±0.012° 
33.68% 


Line Normal Range Velocity (to2 = 0.2) I 


4>m 

ff’A 

Density 


11.24% ± 10.73% 
21.89° ±0.80° 
41.21% 


Plane Normal Range Velocity {td2 = 0-2) I 


4>P3D 

4^abs 

Density 


4.78% ± 2.56% 
0.75 ± 1.69 
25.11% 



intensity images by rescaling their intensities so that all intensities were white 
and then rescaling the acquired images by these same factors. 

In retrospect, if we were to make these images again, we would not use 
only planar surfaces, as only plane range flow can be recovered there. Sparse 
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Fig. 6. The computed XY components of full and line range flow for the NRC sequence. 



Table 2. Direction and magnitude error of the computed Horn and Schunck image 
flow (for 1000 iterations), Lucas and Kanade image flow (for tdi = 1-0) and for the 3D 
range flow computed via the least squares optical-range flow algorithm (for T 03 = 0.2) 
for the synthetic range sequence. 



1 Horn and Schunck XY Flow (1000 iterations) 


(pM 


0.27% ± 0.88% 


(pA 


0.07° ± 0.24° 


Density 


100% 


1 Lucas and Kanade XY Flow {tdi = 1.0) 


4>m 


0.0004% ± 0.0006% 


4>a 


0.0057° ±0.0113° 


Density 


81.86% 


|Least Squares Image-Range 3D Flow (td 2 = 0.2) 


4>m 


0.048% ± 0.005% 


4>a 


0.007° ±0.012° 


Density 


100.0% 



full and line flow can be recovered, but only at the boxes’ corners and edges. 
Nevertheless, we were able to compute some meaningful and dense full range 
flow fields using our regularization algorithms. To attenuate the effects of noise 
artifacts and to improve computational time we used level 1 of the Gaussian 
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Fig. 7. The image flow computed using (a) Horn and Schunck method (1000 iterations) 
and (b) Lucas and Kanade’s method (rm = 1.0 on the NRC sequence. Flows (c) and 
(d) show the XY components of range flow computed using the direct image-range flow 
calculation and direct regularization. 








The Fusion of Image and Range Flow 185 




1 1 i 1 1 i 


11111111 


U 1 i 1 1 


11111111 


1 1 i 1 1 i 


11111111 


i i 1 1 1 1 
1 i 1 1 i 1 

J 1 1 1 i i 
1 i 1 1 i i 


11111111 
1 1 1 1 1 1 1 1 1 1 1 1 

» 1 1 M 1 1 1 1 1 1 1 

i I 1 i i i i 1 i i i 1 

i i 1 * i 1 I i i M 1 

‘ ‘ * i ! 1 1 1 1 


1 1 1 1 1 i 
i i 1 i i J 
i 1 i 1 1 i 
1 1 1 1 1 1 
1 i 1 1 1 1 
i 1 1 i 1 1 


w i i w 1 1 i 1 i 1 

i i 1 i i J J i i 1 1 1 

i i I 1 1 1 1 1 1 1 1 1 

11 1 1 1 1 1 1 1 1 1 1 

11111111111111 
11111111111111 


1 1 1 i i 1 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 


1 i 1 1 1 1 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 




1 1 1 1 1 1 1 1 1 1 1 1 1 1 


1 1 1 1 i 1 
1 1 1 1 1 1 
1 I i i I i 
i i 1 1 1 i 
1 1 1 i M 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 
i 1 i 4 i 1 1 1 1 1 1 1 1 1 

i 1 1 J 1 1 i i i 1 i 1 1 1 

1 i 1 i 1 i i i 1 1 i 1 1 1 

i i i i i 1 ; i i i k 1 


1 i i 1 1 i 
1 1 1 1 1 i 
i 1 i 1 1 1 
1 1 1 i 1 i 
1 1 i 1 i 1 


i i 1 4 i W i k i i 4 
1 4 4 4 4 1 1 4 4 1 4 4 
4 14 4 114 1114 1 
1 1 4 1 1 1 1 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 


i i i 1 1 1 


111111111111 


111111 


111111111111 


111111 


111111111111 


111111 


111111111111 


111111 


111111111111 


111111 


111111111111 


111111 


111111111111 



b 



Fig. 8. (a) Combined regularization XY flow for 1000 iterations and (b) direct reg- 
ularization with 1000 iterations initialized with combined regularization with 1000 
iterations regularization for the NRC sequence. 



Table 3. Direction and magnitude error of the computed flow via the Direct and 
Combined regularization algorithms for 1000 iterations for the synthetic sequence. 



Direct Regularization (1000 iterations) 


4>m 


7.26% ± 8.96% 


(pA 


6.69° ±8.34° 


Combined Regularization (1000 iterations) 


(pM 


1.45% ± 3.44% 


<pA 


0.83° ± 1.96° 



pyramid to compute all flows (3D Gaussian smoothing with a standard deviation 
of 1.0 and the subsampling in the X and Y dimensions by 2). Figure]^ shows 
one intensity image in the sequence while 03 shows its depth {Z) map (scaled 
into an image) . Since we used level 1 of the Gaussian pyramid our correct known 
translation is halved. Because there are intensity patterns on the surfaces (the 
printed newspaper text) and there is slight distortion in parameter estimation 
caused by the local intensity variation, the Z values vary slightly according to the 
surfaces’ intensity and one is able to read some text in the Z images (see Figure 
dSb)). The top and bottom parts and some of the right side of the image in Figure 
(&) are part of the linear positioner setup; one cannot obtain good derivative 
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values here and to increase computational accuracy and speed we masked out 
these parts of the image in our flow calculations. To show fully recovered range 
flow fields we need to show both XY and XZ flow fields; however since the X and 
Z flow components are only about 6% of the Y component for this sequence, the 
XZ flows are quite small relative to the XY flows and due to space limitations 
are not shown here. Figure 0 shows the computed XY full and line range flows 
(section|2I). The plane flows are quite small and not shown here. Table2|give the 
quantitative results for these full, line normal and plane normal fields. Because 
the plane normal flow is so small we just give its absolute error. 

Figures Eli,b shows the image flows recovered by Horn and Schunck’s algo- 
rithm (1000 iterations) and Lucas and Kanade’s algorithm {tdi = 1-0) (section 
®) while Figures [ 7 |: shows the XY flow using our least squares computation on 
the intensity and range derivatives. Table 0 show the magnitude and direction 
errors for these flows. 



Table 4. Direction and magnitude error of the computed full, line and plane range 
velocities wrt the estimated “correct” full, line and range flow for the NRC real range 
sequence. 



Full Range Flow {Tr >2 = 0.2) 


'i/'M 


21.76% ± 23.36% 


IpA 


22.66° ± 14.53° 


Density 


1.04% 


Line Normal Range Flow (td 2 = 0.2) 


4>m 


15.88% ±21.83% 


4>a 


36.41° ± 22.42° 


Density 


18.06% 


Plane Normal Range Flow (td 2 = 0.2) 


(j^ahs 


0.20 ±3.25 


Density 


28.34% 



Figures 01 and Et. show the regularized XY range flow fields for the direct 
(section ( 0 )) and combined (section ( 0 )) algorithms for 1000 iterations while Ta- 
ble 0 shows their magnitude and angle errors. We used a = 10.0 and /? = 1.0 for 
all the regularizations. For direct regularization, overall results are poor because 
most of the image only has plane flow information, the regions surrounding full 
flow have good velocities. Results improve with more iterations. The combined 
regularized flows are the best, these use both intensity and range derivative data 
and yield dense flow. We report one last experiment: we use the flow after 1000 
iterations of the combined regularization algorithm to initialize the direct regu- 
larization algorithm (also 1000 iterations). The flow is shown in Figure 03 and 
the error in Table 0 71.75% of the flow had 10% or less magnitude error (average 
magnitude error of 4.29% ±2.51% and average angle error of 0.24° ± 1.64°). This 
was the best result of all the NRC flows. This use of an initial set of non-zero ve- 
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Table 5. Direction and magnitude error of the computed Horn and Schunck image 
flow (for 1000 iterations), Lucas and Kanade image flow (for rm ~ 1.0) and for the 3D 
range flow computed via the least squares optical-range flow algorithm (for tds = 0.2) 
for the NRC range sequence. 



Horn and Schunck XY Flow (1000 iterations) | 


4>m 

<I>A 

Density 


10.33% ± 12.47% 
3.48° ± 5.53° 
82.67% 


Lucas and Kanade XY Flow (tdi = 1.0) | 


(pM 

4>A 

Density 


10.51% ± 10.073% 
9.68° ± 5.57° 
8.11% 


Least Squares Image-Range 3D Flow {tds = 0.2) | 


4>m 

fpA 

Density 


13.80% ± 12.50% 
14.04° ± 5.94° 
65.25% 



locities in the initialization step of regularization seems to be one way to obtain 
dense accurate flow for the NRC sequence. 



Table 6. Direction and magnitude error of the direct and combined regularized flow 
for 1000 iterations for the NRC sequence. Also shown are the error results when the 
combined regularized flow is used to initialize the direct regularization. The density of 
all flow fields (due to masking) is 82.68%. 



1 Direct Regularization (1000 iterations) | 


(pM 

4>A 

Cor 


39.97% ± 24.52% 

7.84° ± 3.97° 

abined Regularization (1000 iterations) 


<j>M 

<t>A 

D 

in 


15.46% ± 20.06% 

16.50° ± 13.08° 

irect Regularization (1000 iterations) 
itialized by Combined Regularization 
(1000 iterations) 


4>m 

4>a 


9.76% ±9.19% 
5.88° ± 2.97° 



11 Conclusions 

We have shown the computation of full, line normal and plane normal range 
flow on a synthetic intensity/range sequence. Our computation was in a least 
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squares framework j2j; total least squares is used in crnia and we are cur- 
rently investigating the difference. Line normal flow was the most difficult flow 
to compute accurately for this sequence. 

The NRC sequence is perhaps the most difficult type of range sequence to 
analyze; most of the surfaces are planar with little or no full or line normal 
velocity. The direct regularization algorithm were only able to compute full flow 
in the vicinity of this full and line normal flow. The combined regularization used 
both intensity and range data to obtain full flow everywhere. The usefulness of 
combining the two types of data should not be in doubt; its flow was better than 
that with the use of range data alone and, of course, image flow, by itself cannot 
be used to recover the 3’’'^ component of range flow. When we initialized direct 
regularization with combined regularized flow, we obtain the best results. 
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Abstract. The increasing amount of remotely sensed imagery from mul- 
tiple platforms requires efficient analysis techniques. The leading idea of 
the presented work is to automate the interpretation of multisensor and 
multitemporal remote sensing images by the use of common prior knowl- 
edge about landscape scenes. In addition the system can use specific map 
knowledge of a GIS, information about sensor projections and temporal 
changes of scene objects. Prior expert knowledge about the scene con- 
tent is represented explicitly by a semantic net. A common concept has 
been developed to distinguish between the semantics of objects and their 
visual appearance in the different sensors considering the physical prin- 
ciple of the sensor and the material and surface properties of the objects. 

A flexible control system is used for the automated analysis, which em- 
ploys mixtures of bottom up and top down strategies for image analysis 
dependent on the respective state of interpretation. The control strategy 
employs rule based systems and is independent of the application. The 
system permits the fusion of several sensors like optical, infrared, and 
SAR-images, laser-scans etc. and it can be used for the fusion of images 
taken at different instances of time. Sensor fusion can be achieved on a 
pixel level, which requires prior rectification of the images, on feature 
level, which means that the same object may show up differently in dif- 
ferent sensors, and on object level, which means that different parts of an 
object can more accurately be recognized in different sensors. Results are 
shown for the extraction of roads from multisensor images. The approach 
for a multitemporal image analysis is illustrated for the recognition and 
extraction of an industrial fairground from an industrial area in an urban 
scene. 

1 Introduction 

The recognition of complex patterns and the understanding of complex scenes 
from remotely sensed data requires in many cases the use of multiple sensors 
and views taken at different time instances. For this purpose sensors such as 
optical, thermal, radar (SAR), and range sensors are used. In order to automate 
the processing of these sensor signals new concepts for sensor fusion are needed. 
In the following a novel approach to the automated multisensor analysis of aerial 
images is described, which results in a symbolic description of the observed scene 
content. The symbolic description is represented by a semantic net. 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 190-^^^ 2001. 
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Due to the great variety of scenes to be interpreted a modern system for 
image analysis should be adaptable to new applications. This flexibility can 
be achieved by a knowledge based approach where the application dependent 
knowledge is strictly separated from the control of information processing. In the 
literature various approaches to image interpretation have been presented. Most 
interpretation systems like SPAM and SIGMA ^ use a hierarchic control 
and construct the objects incrementally using multiple levels of detail. Inspired 
by ERNEST ^ the presented system AIDA formulates prior knowledge about 
the scene objects with semantic nets. In the following the system architecture is 
described and a common concept for the interpretation of images from multiple 
sensors is presented. 



2 Knowledge Based Interpretation 

For the automatic interpretation of remote sensing images the knowledge based 
system AIDA PI P] has been developed. The prior knowledge about the objects 
to be extracted is represented explicitly in a knowledge base. Additional domain 
specific knowledge like GIS data can be used to support the interpretation pro- 
cess. From the prior knowledge hypotheses about the appearance of the scene 
objects are generated which are verified in the sensor data. An image processing 
module extracts features that meet the constraints given by the expectations. 
It returns the found primitives - like polygons or line segments- to the inter- 
pretation module which assigns a semantic meaning to them, e.g. river or road 
or building. The system finally generates a symbolic description of the observed 
scene. In the following the knowledge representation and the control scheme of 
AIDA is described briefly. 



2.1 Representation of Knowledge 

For the explicit representation of prior knowledge a semantic net has been chosen. 
Semantic nets are directed acyclic graphs and they consist of nodes and links 
in between. The nodes represent the objects expected to appear in the scene 
while the links of the semantic net form the relations between these objects. The 
properties of nodes and edges can be described by attributes. 

Two classes of nodes can be distinguished: the concepts are generic models 
of the object and the instances are realizations of their corresponding concepts 
in the observed scene. Thus, the prior knowledge is formulated consisting of con- 
cepts. During interpretation a symbolic scene description is generated consisting 
of instances. The object properties are described by attributes attached to the 
nodes. They contain an attribute value which is measured bottom-up in the data 
and an attribute range which represents the expected attribute value. The range 
is predefined and/or calculated during the interpretation. For each attribute a 
value and range computation function has to be defined. A judgement function 
computes the compatibility of the measured value with the expected range. 

The relations between the objects are described by links forming the seman- 
tic net. The specialization of objects is described by the is-a relation introducing 
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Fig. 1. Simplified semantic net representing the generic landscape model and its rela- 
tion to the sensor image. 



the concept of inheritance. Along the is-a link all attributes, edges and functions 
are inherited to the more special node which can be overwritten locally. Objects 
are composed of parts represented by the part-of-link. Thus the detection of an 
object can be reduced to the detection of its parts. The transformation of an 
abstract description into its more concrete representation in the data is modelled 
by the concrete- of relation, abbreviated con- of. This relation allows to structure 
the knowledge in different conceptual layers like for example a scene layer^ a 
geometry- and material layer and a sensor layer in Fig. ^ Topological relations 
provide information about the kind and the properties of neighbouring objects. 
For this purpose attributed relations (attr-rel) are introduced. In contrast to 
other edges this relation has attributes which can be used to constrain the prop- 
erties of the connected nodes. For example a topological relation close-to can 
restrict the position of an object to its immediate neighbourhood. The initial 
concepts which can be extracted directly from the data are connected via the 
data-of link to the primitives segmented by image processing algorithms. 



2.2 Processing Control 

To make use of the knowledge represented in the semantic net control knowledge 
is required that states how and in which order scene analysis has to proceed. 
The control knowledge is represented explicitly by a set of rules. For example 
the rule for instantiation changes the state of an instance from hypothesis to 
complete instance, if all subnodes which are defined as obligatory in the concept 
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net have been completely instantiated. If an obligatory subnode could not be 
detected, the parent node becomes a missing instance. 

An inference engine determines the sequence of rule executions. Whenever 
ambiguous interpretations occur they are treated as competing alternatives and 
are stored in the leaf nodes of a search tree. Each alternative is judged by com- 
paring the measured object properties with the expected ones. The judgement 
calculus models imprecision by fuzzy sets and considers uncertainties by distin- 
guishing the degrees of necessity and possibility. The judgements of attributes 
and nodes are fused to one numerical figure of merit for the whole interpretation 
state. The best judged alternative is selected for further investigation. Using 
a mixed top-down and bottom-up strategy the system generates model-driven 
hypotheses for scene objects and verifies them consecutively in the data. Ex- 
pectations about scene objects are translated into expected properties of image 
primitives to be extracted from the sensor data. Suitable image processing al- 
gorithms are activated and the semantic net assigns a semantic meaning to the 
returned primitives. 



2.3 Knowledge Base for the Interpretation of Aerial Images 

For an object recognition only those features are relevant which can on the one 
hand be observed by the sensor and on the other hand give a cue for the presence 
of the object of interest. Hence the knowledge base contains only the necessary 
and visible object classes and properties. The network language described above 
is used to represent the prior knowledge by a semantic net. In Fig. ^part of 
a generic model for the interpretation of remote sensing data in a landscape 
scene containing a purification plant is shown. It is divided into the 3D scene 
layer and the 2D image layer. The 3D scene layer is split into a semantic layer 
and a physical layer, here a geometry- and a material layer. If a geo-information 
system (GIS) is available and applicable an additional GIS layer can be defined 
representing the scene specific knowledge from the GIS. The 2D image layer 
contains the sensor layers adapted to the current sensors and the data layer. 

For the objects of the 2D image domain general knowledge about the sensors 
and methods for the extraction and grouping of image primitives like lines and 
regions is needed. The primitives are extracted by image processing algorithms 
and they are stored in the semantic net as instances of concepts like 2D-Stripe 
or Rectangle. Due to the variety of possible regions they have to be grouped 
according to perceptual criteria like compactness, size, shape etc. The sensor 
layer can be adapted to the current sensor type like optical camera, SAR, range 
sensor, etc. For a multisensor analysis the layer is duplicated for each new sensor 
type to be interpreted assuming that each object can be observed in all the 
images (see Fig. All information of the 2D image domain is given related 
to the image coordinate system. As each transformation between image and 
scene domain is determined by the sensor type and its projection parameters 
the transformations are modelled explicitly in the semantic net by the concept 
Sensor and its specializations for the different sensor types. 

The knowledge about inherent and sensor independent properties of objects 
is represented in the 3D scene domain which is subdivided into the physical. 
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the GIS and the semantic layer. The physical layer contains the geometric and 
radiometric material properties as basis for the sensor specific projection. Hence 
it forms the interface to the sensor layer(s). The semantic layer represents the 
most abstract layer where the scene objects with their symbolic meanings are 
stored. The semantic net eases the formulation of hierarchical and topological 
relations between objects. Thus it is possible to describe complex objects like a 
purification plant as a composition of sedimentation tanks and buildings, which 
are close to a river and are connected by roads to the road net or an industrial 
site as a composition of halls close to each other and parking lots. The symbolic 
objects are specified more concrete by their geometry. In conjunction with the 
known sensor type the geometrical and radiometrical appearance of the objects 
in the image can be predicted. This prediction can be improved if GIS data of the 
observation area is available. Though the GIS may be out of date it represents 
a partial interpretation of the scene providing semantic information. Hence the 
GIS objects are connected directly with the objects of the semantic layer (Fig.^J. 

3 Interpretation of Multisensor Images 

The automatic analysis of multisensor data requires the fusion of the data. The 
presented concept to separate strictly the sensor independent knowledge of the 
3D scene domain from the sensor dependent knowledge in the 2D image domain 
eases the integration and simultaneous interpretation of images from multiple 
sensors. New sensor types can be introduced by simply defining another special- 
ization of the Sensor node with the corresponding geometrical and radiometrical 
transformations. According to the images to be interpreted the different sensor 
layers (SAR, IR, Optical, Range) are activated. 




Fig. 2. Rejected (thin line) and accepted (wide line) road features from a) visual and b) 
infrared image with c) fusion result. Each object (road, river, building, sedimentation 
tank) is approximated by a polygon mesh to model geometry. 
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The interpretation distinguishes the following types of sensor fusion: 

Sensor selection: The object can be extracted completely using only one sen- 
sor. For example, rivers show up clearly in infrared images (Fig. 12b) due to 
their cold temperatures. 

Composite feature: This fusion type exploits several con-of links to combine 
redundant sensors. The extraction of the feature from only one sensor is 
erroneous like the road extraction from the visual sensor or infrared sensor 
alone. Hence the extraction combines the measured feature properties to 
improve the road detection (see Fig. Ej). 

Composite object: The object is composed of several complementary parts, 
indicated by part-of links, which can be extracted from different sensors. The 
purification plant in Fig. 0; consists of sedimentation tanks and buildings 
(Fig. ED- The complex task of detecting a purification plant is simplified to 
the extraction of the buildings from the visual and the sedimentation tank 
from the infrared image. Furthermore, the plant has a road access and is 
located close-to a river to drain off cleaned water. 

Composite context: The object may be only detectable in a certain context. 
For example, the roads in urban areas are usually accompanied by building 
rows along their sides which show up as bright lines in a SAR image. In 
Fig. El only those segmented dark stripes in the aerial image are interpreted 
as roads which are supported by parallel bright lines in the SAR image. 




Fig. 3. The segmented lines, i.e. road candidates in the visual image (a) must be 
accompanied by parallel lines as hint for buildings in the SAR image (b) to verify a 
road hypothesis (c). 
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Fig. 4. Images from an airborne optical sensor (a) and range sensor (b) serving as 
input, (c) Segmentation results for the extraction of halls representing an intermediate 
step during the scene interpretation. 



If a GIS is available the object location can be constrained further. However, 
the GIS may be out of date and incomplete. Hence the GIS is used to hypothesize 
an initial scene description to be tested in the remote sensing data. The use of 
a GIS is described in ^j. 

For the application of an industrial town scene, here the industrial fairground 
in Hanover/Germany, the advantages of a multisensor image analysis based on 
composite feature fusion is illustrated in Fig. 0J In this case an aerial photo 
(Fig. EJl) and a range image (Fig. Eh) are used. Taking only the aerial image the 
flat buildings roofs cannot be differentiated from the streets and the parking lots. 
Using only the range image the walkways for pedestrians cannot be differentiated 
from the parking lots containing cars. If both images are analyzed simultaneously 
the buildings can be clearly separated from the street level by their elevation as 
is shown in Fig. Eh and the walkways can be differentiated from the parking lots 
by the recognition of regular pattern of rows of cars. Other examples for the 
fusion of multisensor images are given in jO] and 0 . 



4 Interpretation of Multitemporal Images 



Gurrently the system is being extended for the interpretation of multitemporal 
images. Applications like change detection and monitoring require the analysis of 
images from different acquisition times. By comparing the current image with the 
latest interpretation derived from the preceding image changes in land use and 
new construction sites can be detected. In the following the necessary extensions 
to a multitemporal analysis with the system AIDA are described and illustrated 
on the application of recognizing an industrial fairground. 
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4.1 Extension of the Knowledge Based System 

The easiest way to generate a prediction for the current image from an existing 
scene interpretation is to assume that nothing has changed during the elapsed 
time. But in many cases humans have knowledge about possible or at least 
probable temporal changes. Hence the knowledge about possible state transitions 
between two time steps should be exploited in order to increase the reliability 
of the scene interpretation. 

Temporal changes can be formulated in a so called state transition graph 
where the nodes represent the temporal states and the edges model the state 
transitions. To integrate the graph in a semantic net the states are represented 
by concept nodes which are connected by a new relation: the temporal relation. 
For each temporal relation a transition probability can be defined. As states can 
either be stable or transient the corresponding state transitions differ in their 
transition time which can also be specified for the temporal relation. For the 
exploitation of the temporal knowledge a time stamp is attached to each node 
of the semantic net which documents the time of its instantiation. As normally 
no knowledge about the temporal changes of geometrical objects is available the 
state transition diagram is part of the scene layer (compare Fig. E|l. In contrast 
to hierarchical relations like part-of or con-of the start and end node of temporal 
relations may be identical - forming a loop - to represent that the state stays 
unchanged over the time. 

During the interpretation process the state transition diagram is used by a 
new inference rule. Analysis starts with the first image of the given sequence 
marked with time stamp ti. If a state of the state transition diagram can be 
instantiated completely, the temporal knowledge is used to hypothesize one or 
more possible successors of this state for the next image in the chronological order 
(time stamp t2). The system selects all successor states that can be reached 
within the elapsed time ^2 ~ ti according to the transition times defined in 
the temporal relations. States which are multiple selected due to loops in the 
transition diagram are eliminated. The possible successor states are sorted by 
decreasing probability so that the most probable state is investigated first. All 
hypotheses are treated as competing alternatives represented in separate leaf 
nodes of the search tree (see Chap. I'Z.'Zt . Starting with the alternative of the 
highest probability the hypotheses for the successor state are either verified or 
falsified in the current image. For continuous monitoring the time stamps of the 
instances can be used to remove the old nodes of ti. 



4.2 Recognition of an Industrial Fairground 

An industrial fairground is an example for a complex structure detectable by 
a multitemporal image interpretation only. Using a single image it would be 
classified as an industrial area consisting of a number of halls. But during several 
weeks of the year some unusual activity can be observed: exhibition booths are 
constructed, visitors pour to the site, and the booths are dismantled again. This 
knowledge about observable events can be exploited for the automatic extraction 
of a fairground and formulated in a semantic net (see Fig.Ej). The different states 
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of a fairground are represented by the concepts Fairinactivity, FairConstruction, 
FairActive, and Fair Dismantling. The states representing the actual fair like 
the construction-state, the active-state and the dismantling-state are transient 
compared to the Fairinactivity-state which is valid most of the year. Therefore 
transition times of four to eight days are defined for the corresponding temporal 
relations and the node Fairinactivity is looped back to itself. 
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Fig. 5. Semantic net for the detection of an industrial fairground with integrated state 
transition graph. 



The analysis starts with the first image looking for an Industrial Area. In the 
given example the system searches for at least three halls and two parking lots. 
If the Industrial Area can be instantiated completely the system tries to refine 
the interpretation by exchanging the Industrial Area by a more special concept. 
There are four possible specializations {Fairinactivity to Fair Dismantling) and 
the search tree splits into four leaf nodes. Each hypothesis is tested in the image 
data. A construction or dismantling phase is characterized by trucks near the 
halls which keep the equipment for the booths. Hence the system searches for 
bright rectangles close to the halls. An active fair can be recognized by parking 
lots filled with cars and - if the image accuracy is sufficient - by persons walking 
on the fairground. 
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If one of the four states can be verified the temporal inference is activated. 
The system switches to the next image in the sequence and generates hypotheses 
for the successor state. According to the elapsed time and considering the tran- 
sition times all possible successors are determined. If for example the time step 
between the two images was two weeks, it is possible that Fair Inactivity follows 
immediately after FairActive omitting the dismantling phase. All hypothesized 
successor states are represented in separate leaf nodes and are treated as compet- 
ing alternatives. Having found hints for all obligatory states a complete instance 
of Industrial Fairground can be generated and the interpretation goal is reached. 
The presented approach is currently being tested for a sequence of five aerial im- 
ages of the Hanover fairground. First results are documented in [^. 

5 Conclusions 

A knowledge based scene interpretation system called AIDA was presented, 
which uses semantic nets, rules, and computation methods to represent the 
knowledge needed for the interpretation of remote sensing images. Controlled 
by an adaptable interpretation strategy the knowledge base is exploited to de- 
rive a symbolic description of the observed scene in form of an instantiated 
semantic net. If available the information of a CIS database is used as partial in- 
terpretation increasing the reliability of the generated hypotheses. The system is 
employed for the automatic recognition of complex structures from multisensor 
images. Different paradigms for multisensoral and multitemporal sensor fusion 
can be used enabling the recognition of complex structures like street nets or in 
this example a purification plant. The use of knowledge about temporal changes 
improves the generation of hypotheses for succeeding time instances and allows 
for example the extraction of complex structures like an industrial fairground. 
In another application a detailed interpretation of moorland areas was accom- 
plished. 

The knowledge based scene interpretation system AIDA is a promising ap- 
proach in the field of image understanding, because it provides a common con- 
cept for the use of multisensor and multitemporal information in connection with 
machine accessible prior knowledge about the scene content. 
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Abstract. We present an approach for learning appearance-based re- 
cognition functions, whose novelty is the sparseness of necessary train- 
ing views, the exploitation of constraints between the views, and a spe- 
cial treatment of discriminative views. These characteristics reflect the 
trade-off between efficiency, invariance, and discriminability of recogni- 
tion functions. The technological foundation for making adequate com- 
promises is a combined use of principal component analysis (PCA) and 
Gaussian basis function networks (GBFN). In contrast to usual applica- 
tions we utilize PGA for an ellipsoidal interpolation (instead of approxi- 
mation) of a small set of seed views. The ellipsoid enforces several biases 
which are useful for regularizing the process of learning. In order to con- 
trol the discriminability between target and counter objects the coarse 
manifold must be fine-tuned locally. This is obtained by dynamically 
installing weighted Gaussian basis functions for discriminative views. 
Using this approach, recognition functions can be learned for objects 
under varying viewing angle and/or distance. Experiments in numerous 
real-world applications showed impressive recognition rates. 



1 Introduction 

Famous physiologists (e.g. Hermann von Helmholtz) insisted on the central role 
of learning in visual processes |2|. For example, object recognition is based on 
adequate a priori information which can be acquired by learning in the actual 
environment. The statistical method of principal component analysis (PCA) has 
been used frequently for this purpose, e.g. by Turk and Pentland for recognition 
of faces IBI, or by Murase and Nayar for recognition of arbitrary objects 0. The 
most serious problem in using PCA for recognition is the daring assumption of 
one multi-dimensional Gaussian distribution of the vector population, which is 
not true in many realistic applications. Consequently, approaches of nonlinear 
dimension reduction have been developed, in which the input data are clustered 
and local PCA is applied for each cluster, respectively. The resulting architec- 
ture is a Gaussian basis function network which approximates the manifold more 
accurately by a combination of local multi-dimensional Gaussian distributions 
1^. However, large numbers of training views are required for approximating 
the Gaussians. Furthermore, the description length has increased which makes 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 201-^^^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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the recognition function less efficient in application. Our concern is to reduce 
the effort of training and description by discovering and incorporating invari- 
ances among the set of object views Apart from characteristics of efficiency 
and invariance, the major criterion for evaluating a recognition function is the 
discriminability, i.e. the capability to discriminate between the target object and 
counter objects. Similar views stemming from different objects are of special in- 
terest for learning reliable recognition functions. This principle is fundamental 
for the methodology of support vector machines. At the border between neigh- 
boring classes a small set of critical elements must be determined from which 
to construct the decision boundary P|. Although the border elements play a 
significant role it would be advantageous to additionally incorporate a statisti- 
cal approximation of the distribution of training samples. Our approach takes 
special care for counter (critical) views but also approximates the distribution 
of all training views. 

2 Foundation for Object Recognition 

For the purpose of object recognition we construct an implicit function /*™, 
which approximates the manifold of appearance patterns under different viewing 
conditions. 



Parameter vector A specifies a certain version subject to the type of the function, 
and measurement vector Z is the representation of an appearance pattern. In 
terms of the Lie group theory of invariance ^ , the manifold of realizations of 
Z is the orbit of a generator function whose invariant features are represented 
in A. Function /*"* must be learned such that equation m holds more or less 
for patterns of the target object and clearly not holds for patterns of counter 
situations. Solely small deviations from the ideal orbit are accepted for target 
patterns and large deviations are expected for counter patterns. The degree of 
deviation is controlled by a parameter tp. 



f'^{A,Z) = 0 



( 1 ) 



\f^{A,Z)\<fr 



( 2 ) 



The function /*"* can be squared and transformed by an exponential function 
in order to obtain a value in the unit interval. 



f\A,Z) :=exp(-/™(A,Z)2) 



(3) 



If function yields value 0, then vector Z is infinite far away from the orbit, 
else if function yields value 1, then vector Z belongs to the orbit. Equa- 
tion m can be replaced equivalently by 



/«(A,Z) = 1 



(4) 



^ Our work in |S| replaces the concept of invariance realistically by the concept of 
compatibility. 
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For reasons of consistency, we also use the exponential function to transform 
parameter ip into C in order to obtain a threshold for proximities, i.e. C := 
exp (—'0^) ■ With this transformations, we replace equation equivalently by 

f^\A,Z)>C (5) 

3 Concept of Canonical Frames and Ellipsoid Basis 
Fnnction 

Learning a recognition function requires the estimation of parameter vector A 
based on measurement vectors Z . However, for an appearance-based approach 
to recognition the input space is high-dimensional as it consists of patterns, 
and frequently, also the parameter vector is high-dimensional. Consequently, 
first we project the high-dimensional space in a low-dimensional subspace (so- 
called canonical frame), and then we do the learning therein. The construction 
of canonical frames is based on so-called seed images which are representative 
for the object. The learning procedure is based on a coarse-to-fine strategy in 
which the coarse part does the subspace construction and is responsible for global 
aspects in the manifold of patterns. The subsequent refinement step treats local 
aspects in the manifold by taking more specific object views or counter situations 
into account, i.e. so-called validation views. 

We impose three requirements to canonical frames. First, the implicit func- 
tion will be defined in the canonical frame and should have a simpler description 
than in the original frame. Second, equation must hold perfectly for all seed 
images which are represented as vector Z, respectively. Therefore, the param- 
eters A are invariant features of the set of all seed images. Third, the implicit 
function should consider generalization biases as treated in the theory of Machine 
Learning nm pp. 349-363]. For example, according to the enlarge-set bias and 
the close-interval bias, the implicit function must respond continuous around the 
seed vectors and must respond nearly invariant along certain courses between 
successive seed images (in the space of images). The minimal-risk bias avoids 
hazardous decisions by preferring low degrees of generalization. 

An appropriate canonical frame together with a definition of the implicit 
function can be constructed by principal component analysis (PCA). Remark- 
ably, we use PCA for interpolating a small set of seed images by a hyper-ellipsoid 
function. As all seed images are equal significant we avoid approximations in or- 
der not to waste essential information. Based on the covariance matrix of the 
seed images, we take the normalized eigenvectors as basis vectors. The represen- 
tation of a seed image in the canonical frame is by Karhunen-Loeve expansion 
(KLE). Implicit function /*"* is defined as a hyper-ellipsoid in normal form with 
the half-lengths of the ellipsoid axes defined dependent on the eigenvalues of the 
covariance matrix, respectively. As a result, the seed vectors are located on the 
orbit of this hyper-ellipsoid, and invariants are based on the half-lengths of the 
ellipsoid axes. 
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4 Construction of Canonical Frame and Ellipsoid Basis 
Function 

Let n := {X?\X? £ = 1, •••,/} be the vectors representing the seed 

images of an object. Based on the mean vector X‘^ the covariance matrix C is 
obtained by 



:= - X", • • • , (6) 

We compute the eigenvectors ei, • • • , e/ and eigenvalues Ai, • • • , A/ (in decreasing 
order). The lowest eigenvalue is equal to 0 and therefore the number of relevant 
eigenvectors is at most (/ — 1) (this statement can be proved easily). The I 
original vectors of 17 can be represented in a coordinate system which is defined 
by just (J — 1) eigenvectors and the origin of the system. KLE 

defines the projection/representation of a vector X in the (/ — l)-dimensional 
eigenspace. 



X := (ii, • • • , := (ei, • • • , ■ (X - X^) (7) 

Based on PCA and KLE we introduce the (/ — l)-dimensional hyper-ellipsoid 
function. 



n{A,z)-.= -1 




( 8 ) 



Measurement vector Z := X is defined according to equation 0 ). Parameter 
vector A := (ki, • • • , k/_i)^ contains parameters ki, which are taken as half- 
lengths of the ellipsoid axes in normal form and are defined as 



Ki := VU-1)- Az (9) 

For the special case of assigning the KLE-represented seed vectors to Z, re^ec- 
tively, we can prove that equation 0) holds perfectly for all seed vectors 0 All 
seed vectors are located on the orbit of the hyper-ellipsoid defined above, and 
therefore, the half-lengths are an invariant description for the set of seed vectors. 
The question of interest is twofold, (i) why use ellipsoid interpolation, and (ii) 
why use PCA for constructing the ellipsoid? 

Ad (i): The hyper-ellipsoid considers the enlarge-set and the close-interval 
biases, as demonstrated in the following. Let us assume three points X^, X|, X| 
in 2D, visualized as black disks in the left diagram of Figure Q The 2D ellipse 
through the points is constructed by PCA. The right diagram of Figure 0 shows 
a constant value 1 when applying function (as defined in equations 0 and 
( 0 ) to all orbit points of the ellipse. Therefore the generalization comprises all 
points on the ellipse (close-interval bias). The degree of generalization can be 
increased furthermore by considering the threshold C and accepting for small 
deviations from 1. The relevant manifold of points is enlarged, as shown by the 
dotted band around the ellipse in Figure □] (enlarge-set bias). 

The proof is given in the Appendix. 



2 
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Fig. 1. (Left) Input space with three particular points from which a 2D ellipse is 
defined by PCA, small deviations from this ellipse are constrained by an inner and 
an outer ellipse; (Right) Result of function along the ellipse, which is constant 1, 
accepted deviations are indicated by horizontal lines with offset 



Ad (ii): In general, more than I points are necessary for fitting a unique 
(/ — l)-dimensional hyper-ellipsoid. PCA determines the first principal axis by 
maximizing the variances which are obtained by an orthogonal projection of the 
sample points on hypothetical axes, respectively. Actually, this is the constraint 
which makes the fitting unique. Figured shows two examples of ellipses fitting 
the same set of three points, the left one was determined by PCA, and the right 
one was fitted manually. As expected, the variance on the right is lower than on 
the left, which is measured along the dashed axes, respectively. Exemplary, it is 
also observed in the figure that the maximum variance (on the left) implies a 
minimal size of the ellipsoid. The size of the ellipsoid manifold correlates with the 
degree of generalization, and therefore, PCA produces moderate generalizations 
by avoiding large ellipsoids (minimal-risk bias). 

5 Fine Approximation Based on Validation Views 

The manifold defined by the hyper-ellipsoid must be refined in order to con- 
sider the discriminability criterion of recognition functions. We take an ensemble 
of validation views into account (different from the ensemble of validation 
views) which in turn is subdivided into two classes. The first class (positive) 
of validation views is taken from the target object additionally and the second 
class A""’ (negative) is taken from counter objects or situations. Depending on 
certain results of applying the implicit function to these validation views we 
specify spherical Gaussians and combine them appropriately with the implicit 
function. The purpose is to obtain a modified orbit which includes target views 
and excludes counter views. 

For each validation view X"" € U X™ the function yields a measure- 
ment of proximity rjj to the hyper-ellipsoid orbit. 

rj, := 



( 10 ) 
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Fig. 2. Ellipses fitted through three points; (Left) Ellipse determined by PGA, showing 
hrst principal axis, determined by maximizing the variance; (Right) Ellipse determined 
manually with less variance along the dashed axis. 



For rjj = 0 the view is far away from the orbit, for r]j = 1 the view belongs 
to the orbit. There are two cases for which it is reasonable to modify the implicit 
function. First, maybe a view of the target object is too far away from the orbit, 
i.e. and rjj < C. Second, maybe a view of a counter situation is too 

close to the orbit, i.e. XJ G X'"" and rjj > (. In the first case the modified func- 
tion should yield a value near to 1 for validation view Xj, and in the second case 
should yield a value near to 0. Additionally, we would like to reach generalization 
effects in the local neighborhood (in the space of views) of the validation view. 
The modification of the implicit function takes place by locally putting a spheri- 
cal Gaussian into the space of views, then multiplying a weighting factor to 
the Gaussian, and finally adding the weighted Gaussian to the implicit function. 
The mentioned requirements are reached with the following parameterizations. 
The center vector of the Gaussian is defined as XJ. 

ff%X):=exp(^-^-\\X-X]\\^ (11) 

For the two cases we define the weighting factor wj dependent on rjj. 

_ f 1 — Vj ■ first case ( target pattern too far away from orbit ) , , 

^ ' J —rjj : second case ( counter pattern too close to orbit ) ' ' 

The additive combination of implicit function and weighted Gaussian yields 
a new function for which the orbit has changed, and in particular meets the 
requirements for the validation view X = XJ. In both cases, the Gaussian value 
is 1, and the specific weight plays the role of an increment respective decrement 
to obtain the final outcome 1 for the case XJ G X'’^ respective 0 for vector 
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X'" G A”"". The coarse-to-fine strategy of learning can be illustrated graphically 
(by recalling and modifying Figure Q). The course of proximity values obtained 
along the ellipse is constant 1 (see Figure |3 left and right), and along a straight 
line passing the ellipse perpendicular, the course of proximity values is a Gaussian 
(see left and middle). 




Fig. 3. (Left) Ellipse through three seed vectors and perpendicular straight line across 
the ellipse; (Middle) Gaussian course of proximity values along the straight line; (Right) 
Constant course of proximity values along the ellipse. 



The first example considers a counter vector, i.e. X2 G df"", which is too 
near to the ellipse. A Gaussian is defined with X2 as center vector, and 
weight W2 defined by 772- Based on the additive combination of implicit function 
and weighted Gaussian the value decreases locally around point X2. Figure El 
(left and middle) shows the effect along the straight line passing through the 
ellipse, i.e. the summation of the two dashed Gaussians results in the bold curve. 
Figure 0 (left and right) shows the effect along the ellipse, i.e. the constancy is 
disturbed locally, which is due to diffusion effects originating from the added 
Gaussian. 

The second example considers an additional view from the target object, 
i.e. X3 G which is far off the ellipse orbit. The application of at X3 
yields 773. A Gaussian is defined with vector Afg taken as center vector, and the 
weighting factor W3 is defined by (1 — 773). The combination of implicit function 
and weighted Gaussian is constant 1 along the course of the ellipse (for this 
example), and additionally the values around X^ are increased according to a 
Gaussian shape (see Figure EJ. 



6 Construction of Recognition Functions 



The recognition function is defined as sum of implicit function and linear 
combination of Gaussians. 



208 



J. Pauli and G. Sommer 




Fig. 4. (Left) Ellipse through three seed vectors and perpendicular straight line 
through a counter vector located near to the ellipse; (Middle) Along the straight line the 
positive Gaussian course of proximity values is added with the shifted negative Gaus- 
sian originating from the counter vector, such that the result varies slightly around 0; 
(Right) Along the ellipse the values locally decrease at the position near the counter 
vector. 




Fig. 5. (Left) Ellipse through three seed vectors and perpendicular straight line 
through a further target vector located far off the ellipse; (Middle) Along the straight 
line the positive Gaussian course of proximity values is added with the shifted positive 
Gaussian originating from the target vector, such that the result describes two shifted 
Gaussians; (Right) Along the ellipse the values are constant 1. 



J 

nX) := f\A,X) + Y.W,- f^%X) (13) 

Vector X represents an unknown view which has to be recognized. Parameter 
vector A is determined during the coarse approximation phase, and the set 
of Gaussians is constructed during the fine approximation phase. Factor t for 
specifying the extent of the Gaussians is obtained by the Levenberg-Marquardt 
algorithm [3 PP- 683-688]. 
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This coarse-to-fine strategy of learning can be applied to any target object 
which we would like to recognize. If fc G {1, • • • , if} is the index for a set of target 
objects, then recognition functions with k G {1, ■ ■ ■ , K}, can be learned 
as above. The final decision for classifying an unknown view X is by looking 
for the maximum value computed from the set of recognition functions 
kG{l,---,K}. 



fc* := arg max (14) 

For taking images under controlled viewing conditions, i.e. viewing angle or 
distance, the camera can be mounted on a robot arm and moved in any desired 
pose. The simplest strategy for selecting seed views is a regular discretization 
of the space of possible viewing poses. The selection of validation views may be 
done in a similar way, but considering pose offsets for views of the target objects 
and also taking images from counter situations. Several improvements for these 
strategies are conceivable, and the most important one concerns the treatment of 
validation views. For example, the fine approximation phase may be performed 
iteratively by checking for every validation view the necessity for modifying 
the emerging recognition function. Actually, we must install a new Gaussian 
only in the case of facing a recognition error according to the decision criterion 
given in equation dn. Every validation view is considered as a candidate, and 
only if the view is critical then the refinement may take place dynamically. This 
sophisticated strategy reduces the description length of the recognition function. 
Other interesting work has been reported belonging to the paradigm of active 
learning in which random or systematic sampling of the input domain is replaced 
by a selective sampling [IJ. This paper doesn’t focus on this aspect. 



7 Experiments with the Coarse-to-Fine Strategy of 
Learning 

The primary purpose is to obtain a recognition function for a target object of 
three-dimensional shape, which can be rotated arbitrary and can have different 
distances from the camera. According to this, both the gray value structure and 
the size of the target pattern varies significantly. Three objects are considered 
which look similar between each other, i.e. integrated circuit, chip carrier, bridge 
rectifier. Figure El shows a subset of three images from each object, respectively. 
Different sets of seed and validation ensembles will be used for learning. Exem- 
plary, we only present the recognition results for the integrated circuit. A set 
of 180 testing images is taken which differs from the training images in offset 
values of the rotation angle and in the size of the patterns, as shown by three 
overlays in Figure 0 

We determine recognition results, first by using a 1-nearest-neighbor ap- 
proach, second by applying a coarse manifold approximation, and third compare 
them with those recognition results obtained from our coarse-to-fine strategy of 
learning. The approaches have in common that in a first step a testing pattern 
is projected into three canonical frames (CFs), which are the eigenspaces of 
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Fig. 6. Three seed images from three objects, respectively. 



Fig. 7. Overlay between a training image and three testing images. 



the three objects, respectively. The second step of the approaches is the charac- 
teristic one. In the first approach {CFimn, INN for f-A%arest-A%ighbor) the 
recognition of a testing view is based on the proximity to all seed views from all 
three objects, and the relevant seed view determines the relevant object. In the 
second approach {CFell, ELL for proximity to ULLipsoids) the recognition of 
a testing view is based on the proximity to the three hyper-ellipsoids defined 









X 
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in the canonical frames, respectively. In the third approach {CFegn, EGN for 
proximity to Ellipsoids extended with Gaussian Aetworks), the recognition of a 
testing view is based on a refinement of the coarse approximation of the pattern 
manifold by considering counter views with a network of GBFs, i.e. our favorite 
coarse-to-fine approach of learning. For validation views we simply take the seed 
views of the other objects, respectively. The decision for recognition is based on 
equation (HI. 

We make experiments with different numbers of seed views and thus obtain 
canonical frames of different dimensionality. Exemplary, 6, 12, 20, and 30 seed 
views are used which give dimension 5, 11, 19, and 29 of the canonical frames 
(denoted by N NS 2 , NS 3 , NS 4 , respectively, NS for Aumber of 5eed views). 
Table n shows the results, i.e. the numbers of recognition errors, when applying 
the three approaches and taking four different dimensions into account. Ap- 
proach CFell clearly surpasses CF^en-, and our favorite approach CFegn is 
clearly better than the other two. The course of recognition errors of GFegat, by 
increasing the dimension, shows the classical conflict between over-generalization 
and over-fitting. That is, the number of errors decreases significantly when in- 
creasing the dimension from N S\ to AS' 2 , and remains constant or even increases 
when increasing the dimension further from NS 2 to N or to N S^. Therefore, 
it is convenient to take the dimension NS 2 for the recognition function as com- 
promise, which is both reliable and efficient. Qualitatively, all our experiments 
showed similar results (we omit to present them in this paper). 



Table 1. Recognition errors for a testing set which consists of 180 elements. The 
approaches of object recognition have been trained alternatively with 6, 12, 20, or 30 
seed vectors, for the CFegn approach we take into account additionally 12, 24, 40, or 
60 validation vectors. 



Errors 


NSi 


AS '2 


ASs 


AS 4 


CFinn 


86 


59 


50 


49 


CFell 


32 


3 


14 


18 


CFegn 


24 


1 


2 


3 



According to the last row in Table d a slight increase of the number of 
recognition errors occurs when raising the number of seed views beyond a certain 
threshold, e.g. 20 or 30 seed views in our experiments. Therefore, the advantage 
of considering generalization biases (mentioned in Section 0) is weakened to a 
certain extent. This undesired finding is due to the fact that each new seed view 
will lead to an additional dimension and thus will cause a redefinition of the 
canonical frame. The generalization induced by the higher dimensional hyper- 
ellipsoid may become more and more hazardous. 

A more sophisticated approach is needed which increases the dimension on 
the basis of several (instead of just one) additional seed views. However, the 
purpose of this work has been to demonstrate the advantageous role of gen- 
eralization biases in neural network learning, which is obtained by combining 
Gaussian basis funtion networks with hyper-ellipsoidal interpolations. 
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8 Discussion 

The work presented a learning paradigm for appearance-based recognition func- 
tions. Principal component analysis (PCA) and Gaussian basis function networks 
(GBFN) are combined for dealing with the trade-off between efficiency, invari- 
ance, and discriminability. PGA is used for incorporating generalization biases 
which is done by a hyper-ellipsoid interpolation of seed views. GBFN is respon- 
sible for making the recognition function discriminative which is reached by a 
dynamic installation of weighted Gaussians for critical validation views. The 
combined set of training views is sparse which makes the learning procedure 
efficient and also results in a minimal description length. Apart from that, the 
discriminability of the learned recognition functions is impressive. 

The presented learning paradigm is embedded in our methodology of devel- 
oping Robot Vision systems. It works in combination with Active Vision strate- 
gies, i.e. we must exploit the agility of a camera in order to constrain the pos- 
sible camera-object relations and thus reduce the complexity of the manifold. 
Specifically controlled camera movements enable the incorporation of further 
constraints, e.g. space-time correlations, log-polar invariants, which make the 
manifold construction more sophisticated. 

We may extend the iterative learning procedure such that also canonical 
frames are constructed dynamically. This would be in addition to the dynamic 
installation of Gaussians. The purpose is to find a compromise between the 
dimension of canonical frames and the number of Gaussians, i.e. keep the product 
of both numbers as low as possible to reach minimum description length. The 
mentioned concept is a focus of future work. 



Appendix 

Let a hyper-ellipsoid be defined according to Section 0 

Theorem All seed vectors X\ G G {1, •••,/} are located on the hyper- 
ellipsoid. 

Proof. There are several (/ — 1) -dimensional hyper-ellipsoids which interpolate 
the set 12 of vectors, respectively. PGA determines the principal axes ei, • • • , e/_i 
of a specific hyper-ellipsoid which is subject to maximization of projected vari- 
ances along candidate axes. For the vectors in we determine the set 12 of 
KLE-transformed vectors Xi := {xip, ■■■ , Xij-i)^, i G {1, • • • , /}. All vectors in 
12 are located on a normal hyper-ellipsoid with constant Mahalanobis distance h 
form the origin. With the given definition for the half-lengths we can show that 
h is equal to 1, which will prove the theorem. 

For the vectors in /2 the variance vi along axis e;, I G {1, • • • , / — 1} is given 

by vi := j ■ (xf i x'j i). The variances vi are equal to the eigenvalues A;. 

^2 ^2 

For each vector Xi we have the equation -|- • • • H — = h, because the 
vectors are located on a normal hyper-ellipsoid. Replacing nf in the equation 
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by the expression ^ H -|- 

0 



h® 



hxj. 



i € {1, • • • , /} yields the equation 



o;| ij yields the following equation • 
= h. Summing up all these equations for 
■ {I — 1) = I ■ h, which results in = 1. 



q.e.d. 
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Abstract. A scene change detection method is presented in this paper, 
which analyzes both auditory and visual information sources and ac- 
counts for their inter-relations and coincidence to semantically identify 
video scenes. Audio analysis focuses on the segmentation of the audio 
source into three types of semantic primitives, i.e. silence, speech and 
music. Further processing on speech segments aims at locating speaker 
change instants. Video analysis attempts to segment the video source 
into shots, without the segmentation being affected by camera pans, 
zoom-ins/outs or significantly high object motion. Results from single 
source segmentation are in some cases suboptimal. Audio-visual interac- 
tion achieves to either enhance single source findings or extract high level 
semantic information. The aim of this paper is to identify semantically 
meaningful video scenes by exploiting the temporal correlations of both 
sources based on the observation that semantic changes are characterized 
by significant changes in both information sources. Experimentation has 
been carried on a real TV serial sequence composed of many different 
scenes with plenty of commercials appearing in-between. The results are 
proven to be rather promising. 



1 Introduction 

Content-based video parsing, indexing, search, browsing and retrieval have re- 
cently grown to active research topics due to the enormous amount of unstruc- 
tured video data available nowadays, the spread of its use as a data source in 
many applications and the increasing difficulty in its manipulation and retrieval 
of the material of interest. The need for content-based indexing and coding has 
been foreseen by ISO/MPEG that has introduced two new standards: MPEG-4 
and MPEG-7 for coding and indexing, respectively [Q. 

In order to efficiently index video data, one must firstly semantically identify 
video scenes. The term scene refers to one or more successive shots combined 
together because they exhibit the same semantically meaningful concept, e.g. 
a scene that addresses the same topic although many shots may be involved. 
The term shot denotes a sequence of successive frames that corresponds to a 
single camera start and end session. Scene characterization should be content- 
and search-dependent. The task of semantic scene identification is rather tedious 
and no automatic approaches have been reported to date. Usually, low-level pro- 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 214-^^^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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cessing of the visual data is initially undertaken. Shot boundary detection, i.e., 
temporal segmentation, is performed and analysis of detected shots follows 03 
2] . Results are enhanced and higher level semantic information can be extracted 
when other information sources are analyzed, such as aural or textual ones |3 
lt)ltlS| . It is evident that semantic characterization can only be achieved with 
annotator intervention or by imposing user-defined interaction rules and domain 
knowledge. 

A scene change detection method is presented in this paper which analyzes 
both auditory and visual sources and accounts for their inter-relations and syn- 
ergy to semantically identify video scenes. The audio source is analyzed and 
segmented into three types of semantic primitives: silence, speech and music. 
Further analysis on speech parts leads to the determination of speaker change 
instants, without any knowledge on the number or the identity of speakers and 
without any need for a training process. The video source is processed by a 
combination of two shot boundary detection methods based on color frame and 
color vector histogram differences in order to efficiently detect shot boundaries 
even under various edit effects and camera movement. Combination of the re- 
sults extracted from single information sources leads to grouping a number of 
successive shots into a scene according to whether they are in-between two suc- 
cessive speaker change instants or the same music segment accompanies them, 
or there are long duration silence segments before and after them. If further 
speaker alternation is attempted, such scenes can also be partially identified as 
commercials or events or dialogue scenes. In Sect.|3 the tools for low-level audio 
analysis and segmentation are summarized, while in Sect. 0 video segmentation 
into shots is reported. In Sect. 0 scene identification by combining both aural 
and visual information based on interaction rules is presented. Simulation results 
on a TV serial sequence of around 15 min duration containing many commercials 
are reported in Sects. 0 Finally, conclusions are drawn in Sect.0 

2 Audio Analysis 



Audio analysis aims at segmenting the audio source into three types of semantic 
primitives: silence, speech and music. Further processing on speech segments 
attempts to locate speaker change instants. Segmentation and speaker change 
identification are achieved by low-level processing methods. In the sequel, the 
term audio frame refers to the shortest in duration audio part used in short-time 
audio analysis, whereas the term segment refers to a group of a variable number 
of successive frames pre-classified to one of the three predefined audio types. 

Initially, silence detection is performed to identify silence periods and dis- 
card them from subsequent analysis. Silence frames are audio frames of only 
background noise with a relatively low energy level and high zero crossing rate 
(ZCR) compared to other audio signal types. In order to distinguish silence from 
other audio signal types, the average magnitude Mt and zero crossing rate Zt 
functions of an M-sample audio frame Xt(n), n = 0 , . . . , M — 1 , are exploited |n|: 
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M-l 

M,= Y^\xm ( 1 ) 

1 ^ 
k=l 

t = 0,..,N — 1, where N is the total number of audio frames. Non-overlapping 
audio frames of 10msec duration are employed. A convenient approach to robust 
speech-silence discrimination is end point detection which determines the 
beginning and end of words, phrases or sentences so that subsequent process- 
ing is applied only on these segments. Average magnitude and ZCR thresholds 
are chosen relative to the background noise characteristics of an apriori known 
audio interval, its average magnitude and ZCR functions being and Zt^n 
respectively. The average magnitude thresholds used by endpoint detection are 
set equal to: 



Mthr ,up — E[Mt] 

^thr.iow — cnax(Aff^^) (3) 

The ZCR threshold is set equal to: Zthr = max(Z(_„). Such a threshold selection 
proves to be robust and endpoint detection is satisfactorily performed. Bound- 
aries of words, phrases or entire sentences are well estimated, a useful outcome 
that is subsequently exploited for audio segmentation and characterization. 

Music detection is further performed to discriminate speech from music. Mu- 
sic segments are audio parts having significant high frequency content, high ZCR, 
different periodicity, compared to speech segments (voiced parts), and usually 
long duration. The latter is attributed to the fact that music does not usually 
exhibit silence periods between different successive parts leading to a long audio 
segment. Thus, in order to distinguish speech from music, four criteria are used: 
an energy measure, the ZCR, a correlation measure in the frequency domain that 
attempts to detect periodicity, and, finally, segment duration. Energy, Mt, and 
ZCR, Zf, values are evaluated by m and o, respectively, on audio frames of 10 
msec duration located inside the current segment St, i = 1, .. . ,Ns, where Ns 
is the total number of detected segments other than silence ones. Subsequently, 
segment-based mean values and variances of Mt and Zt are estimated, i.e.: 

— E[Mt\t G 5^] MZg. = E[Zt\t G Si] 

4,^ = E[(Zt - MZs, )^] 

Their quotient is considered more discriminative for recognizing music from 
speech: 



QMst 

QZsi 



fJ-Ms, 



Zs, 



( 5 ) 



( 6 ) 
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Because both long-term (segment-based) energy and ZCR mean values are higher 
for music than speech. Besides, due to the existence of voiced and unvoiced 
parts in speech, long-term variance values of speech segments are expected to 
be higher than musical ones. In order to take advantage of the long duration 
periodicity of music, a frequency-based correlation metric C± is defined between 
the magnitude spectrums of successive non-overlapping audio frames of 30msec 
located in segment Si, i = 1, . . . , Ns'- 

M-l 

k=0 

where denotes the Fourier transform operator. If the signal is periodic, Xt 
and Xt-i will have almost identical spectra, thus leading to a high correlation 
value. Correlation is performed in frequency due to the fact that the Fourier 
transform remains unaffected by time shifts. In the case of music, Ct is expected 
to attain constantly large values within Si- On the other hand, speech, char- 
acterized by both periodic (voiced) and aperiodic (unvoiced) parts, will have 
alternating high and low values of Ct within Si- Thus, segment-based mean val- 
ues of Ct, HCs — G Si] are considered to be adequately discriminative for 

detecting music, ncs- is expected to be higher for music segments than speech 
ones. Finally, the segment duration ds^, i = 1, ■ • • , Ns, is also employed. Each 
of the metrics QMs,, QZs,, HCs- ds^ are individually good discriminators 
of music. Global thresholding with thresholds: 

tm = EiQMs,] + 

Tz = Ie[QZs,] 

Tc = 2E[/rcsJ 
Td = 5sec 

respectively, leads to individual but suboptimal detection of music segments. 
Combination of these results in order to enhance music detection is based on the 
validity of the expression: 

{{QMs, > Tm) or {ds, > Td)) OR ((QZs. > Tz) AND (^Cs- > Tc)) 

( 12 ) 

If (1 1 2jl is true for a segment Si, then this segment is considered to be a music 
segment. Otherwise, it is declared as a speech segment. It is noted that audio 
segments, that may contain both speech and music, are expected to be classified 
according to the most dominant type. 

Speech segments are further analyzed in an attempt to locate speaker change 
instants. In order to do that, low-level feature vectors are firstly extracted from 
voiced pre-classified frames only 0, located inside a speech segment. Voiced- 
unvoiced discrimination is based on the fact that unvoiced speech sounds exhibit 
significant high frequency content in contrast to voiced ones. Thus, the energy 



( 8 ) 

( 9 ) 

( 10 ) 

( 11 ) 
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distribution of the frame signal is evaluated in the lower and upper frequency 
bands (the boundary is set at 2kHz with a sampling rate of llkHz). High to 
low energy ratio values greater than 0.25 imply unvoiced sounds, that are not 
processed further. For audio feature extraction in voiced frames, the speech signal 
is initially pre-emphasized by an FIR filter with transfer function H(z) = 1 — 
0.95z“^. Speech frames are used of 30msec duration each with an overlap of 
20msec with each other. Each frame is windowed by a Hamming window of size 
M. Finally, the mel-frequency cepstrum coefficients (MFCC), c = {ck, k S [l,p]}, 
are extracted per audio frame UDI . p is the dimension of the audio feature vector. 
The aim now is to locate speaker change instants used later on for enhancing 
scene boundary detection. In order to do that, firstly feature vectors of successive 
K speech segments Sko^ • ■ ■ ; Skq+k, are grouped together to form sequences of 
feature vectors of the form m 

^ = {ci, • ■ ■ , j *^1) • • • ) ) ■ ■ • ) Cl, . . . , CLs^^^^ } (13) 

" V " ' V " ' V ' 

Skq Skq+1 Skq+k 

Grouping is performed on the basis of the total duration of the grouped speech 
segments. This is expected to be equal or greater than 2sec, when assuming that 
only one speaker is talking. Consecutive sequences X and Y of feature vectors 
of the form dS», with Y composed of K' speech segments and defined by: 



y = {ci,.. 



■V 

Skq+k+1 



. . , Cl , . . 



■jCLs } 

^Ko+K + K' 

V 

'V' 

^Kg+K+K' 



( 14 ) 



are considered, having a common boundary at the end of Skq+k and the beggin- 
ing of Skq+k+ 1 - The similarity of these two sequences is investigated by firstly 
evaluating their mean vectors, px, PY, and their covariance matrices, Sx, Xy, 
and then defining the following distance metric: 

Dt{X, Y) = {px — f^Y)Xx^{px — Py)^ + (py — Px)Xy^{py — pxY ( 15 ) 



Dt is evaluated for the next pairs of sequences X, T, until all speech segments 
have been used. The immediate next pair is constructed by shifting the X se- 
quence by one segment, i.e. starting at Sko+Ij and re-evaluating numbers K 
and K', so that the constraint on total duration is met. This approach is based 
on the observation that a speaker can be sufficiently modeled by the covariance 
matrix of feature vectors extracted from his utterances. Furthermore, the covari- 
ance matrices evaluated on feature vectors coming from utterances of the same 
speaker are expected to be identical. Adaptive thresholding follows to locate 
speaker change instants. Local mean values on a Id temporal window W of size 
N\y are obtained, without considering the value of Dt at the current location to: 



Dm — E[Dt\t^w,t^to]- ( 16 ) 

Dig is examined to specify whether it is the maximum value of those ones inside 
the temporal window (possibility of a speaker change instant at to)- H this is 
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the case and Dt^/Dm > e, where e is a constant controlling the strictness of 
thresholding, a speaker change instant is detected at tg- Speaker change instants 
are a clue for shot or even scene breaks. The method may be further investigated 
to identify speaker alternation and identify dialogue shots/scenes. 

3 Video Analysis 

Video analysis involves the temporal segmentation of the video source into shots. 
Shot boundary detection is performed by combining distance metrics produced 
by two different shot boundary detection methods. Such a dual mode approach 
is expected to lead to enhanced shot boundary detection results even under 
significant camera or object movement or camera effects, thus overcoming the 
drawbacks of the single modalities in some cases. 

The first method estimates color frame differences between successive frames. 
Color differences, FD{t), are defined by: 

where I(x;t) = [A(x; f)/g(x; t) J;,(x; t)]^ represents the vector-valued pixel in- 
tensity function composed of the three color components: /^.(x;!), /g(x;f) and 
/h(x;f). By ||.||i the Li-vector norm metric is denoted, x = (x,y) spans the spa- 
tial dimensions of the sequence (each frame is of size Nx x Ny) whereas t spans 
its temporal one. Frame differencing is computationally intensive but seldom any 
limitations on the processing time are imposed when the task is performed off- 
line. In order to detect possible shot breaks, the adaptive thresholding approach 
used for detecting speaker change instants in Sect. Elis adopted. Such window- 
based thresholding offers the means of adaptive thresholding according to local 
content and proves flexible and efficient in gradual camera movements, signifi- 
cantly abrupt object or camera movements, and simple edit effects as zoom-ins 
and outs (no false positives, over-segmentation is avoided). Abrupt changes are 
directly recognised. 

The second method evaluates color vector histograms of successive frames 
and computes their bin-wise differences. Summation over all bins leads to the 
metric that is used for shot break detection. Histogram-based methods are robust 
to camera as well as to object motion. Furthermore, color histograms are invari- 
ant under translation and rotation about the view axis and change only slowly 
under change of view angle, change in scale, and occlusion. However, histograms 
are very sensitive to shot illumination changes. To overcome this problem and 
make the method more robust, our approach operates in the HLS color space and 
ignores luminance information. Thus, instead of using HLS vector histograms 
(3- valued vector histograms), the method uses HS vector ones (2- valued vector 
histograms). Luminance conveys information only about illumination intensity 
changes, while all color information is found in the hue and saturation domain. 
Usually, hue contains most of the color information. Saturation is examined and 
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used to determine which regions of the image are achromatic. In order to evalu- 
ate HS vector histograms, the hue range [0°, 360°] is divided in 32 equally-spaced 
bins hi, i = 1, . . . , 32, and the saturation range [0, 1] in 8 equally-spaced bins Sj, 
j = 1, . . . ,8. Vector bins are constructed by considering all possible pairs of the 
scalar hue and saturation bins, leading thus to a total number of 256 vector bins 
hsfc = (hi,Sj), k = 1,...,256. Such an approach translates to a 256 uniform 
color quantization for each frame. The color vector bin-wise histogram H(hsk', t) 
for frame t is computed by counting all pixels having hue and saturation val- 
ues lying inside the considered vector bin hs/^ and dividing by the total number 
of frame pixels. The histogram differences, HDt, are then computed for every 
frame pair {t — 1, t), by: 

^ 256 

HDt = ^ ^Ml|g(hsfc;t) - iJ(hsfc;t- l)||i) (18) 



where k is the vector bin index. By || • ||i, the Li-vector norm metric is denoted. 
Each frame is of size Nx x Ny and t is a temporal spatial dimension of the 
sequence. Histogram differencing is computationally intensive. In order to detect 
possible shot breaks, our approach firstly examines the validity of the expression: 



2 * E[HDt] < 



max{HDt) — min{H Dt) 
2 



(19) 



If it is true, then the sequence is composed by a unique shot without any shot 
breaks. In the opposite case, the adaptive thresholding technique introduced for 
detecting speaker change instants is also employed here, leading to efficient shot 
break detection. Abrupt changes are directly recognized, but the method is also 
satisfactorily efficient with smooth changes between different shots. 

However, both frame difference and color vector histogram based methods, 
employed separately, exhibit limited performance, than when combined together. 
Thus, fusion of single case outcomes is proposed. Specifically, the difference met- 
rics (C3 and m are multiplied to lead to an overall metric: 



ODt = FDt ■ HDt 



(20) 



that is adaptively thresholded later on for shot cut detection. Despite its sim- 
plicity, such multiplication amplifies peaks of the single case metrics, possibly 
corresponding to shot cuts, while it lowers significantly the remaining values. 
The same adaptive thresholding method is employed here as well, leading to en- 
hanced detection compared to the single case approaches. Strong object motion 
or significant camera movement, edit effects, like zoom ins-outs, and in some 
cases dissolves (dominant in commercials) are dealt with. Over-segmentation 
never occurs. 



4 Audio-Visual Interaction: Scene Boundary Detection 
and Partial Scene Identification 

Our aim is to group successive shots together into semantically meaningful scenes 
based on both visual and aural clues and using interaction rules. Multimodal 
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interaction can serve two purposes: (a) enhance the “content findings” of one 
source by using similar content knowledge extracted from the other source (s), 
(b) offer a more detailed content description about the same video instances 
by combining the content descriptors (semantic primitives) of all data sources 
based on interaction rules and coincidence concepts. Temporal coincidence due 
to the temporal nature of video data is a very convenient tool for multimodal 
interaction. 

The combination of the results extracted from the single information sources 
leads to the grouping of a number of successive shots into a scene according to a 
number of imposed constraints and interaction rules. It is noted here that, given 
the results of the presented aural and visual segmentation algorithms, only scene 
boundaries are determined, while scene charecterization, e.g dialogue scene, can 
only be partially performed in some cases. Further analysis on those and on 
additional rules may lead to overall scene characterization. Shot grouping into 
scenes and scene boundary determination is performed in our case when the 
same audio type (music or speaker) characterizes successive shots. Partial scene 
identification is done according to the following concepts: 

— commercials are identified by their background music and the many, short 
in duration, shots that they have. 

— dialogue scenes can be identified by the high speaker alternation rate exhib- 
ited inside the scene. 

5 Simulation Results 

Experimentation has been carried on several real TV sequences having many 
commercials in-between, containing many shots, characterized by significant edit 
effects like zoom-ins/outs and dissolves, abrupt camera movement and significant 
motion inside single shots. We shall present here a representative case of a video 
sequence of approximately 12 min duration that has been digitized with a frame 
rate of 25fps at QCIF resolution. The audio track is a mixture of silence, speech, 
music and, in some cases, miscellaneous sounds. The audio signal has been sam- 
pled at llkHz and each sample is a 16bit signed integer. In the sequel, firstly the 
performance of the various aural and visual analysis tools presented in Sects. El 
and 0 will be investigated. Then, scene change detection will be examined and 
partial scene characterization will be attempted. 

In order to evaluate the performance of the audio segmentation techniques, 
the following performance measures have been defined: 

— Detection ratio: the % ratio of the total duration of correctly detected in- 
stances versus that of the actual ones, 

— False alarm ratio: the % ratio of the total duration of falsely detected in- 
stances versus that of the actual ones, 

— False rejection ratio: the % ratio of the total duration of missed detections 
versus that of the actual ones. 
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focusing initially on the performance of the aural analysis tools. Thus, silence 
detection exhibits a remarkable performance of 100% detection ratio and 0% 
false rejection ratio, achieving to locate entire words, phrases or sentences. Rare 
occasions of unvoiced speech frames being classified as silence frames have only 
been observed leading to a false alarm ratio of 3.57%. There was no case of 
silence being classified as any other kind of audio types searched for. Music de- 
tection exhibits 96.9% detection ratio, 3.1% false rejection ratio, because some 
music segments of short duration are being confused as speech. It has 7.11% 
false alarm ratio, because it confuses some speech segments as music ones. On 
the other hand, speech detection is characterized by 86.2% detection ratio, 13.8% 
false rejection ratio and 2.4% false alarm ratio by mistaking music segments as 
speech. Finally, speaker change instant detection attains a suboptimal perfor- 
mance mainly attributed to the fact that covariance matrices and their inverse 
ones are insufficiently evaluated given a limited number of feature vectors ex- 
tracted from 2sec duration segments. However, the use of bigger audio segments 
would imply that the same speaker is speaking during a longer duration, which 
would be long in many cases. Speaker change instants are evaluated with a detec- 
tion accuracy of 62.8%. We have 30.23% false detections, while missed detections 
are of a percentage of 34.89%. Enhancement of this method may be achieved by 
simultaneously considering other similarity measures as well, as shown in EH- 
Despite, however, of the suboptimal performance of speaker change instants de- 
tection, their use during audio-visual interaction for scene boundary detection 
leads to a satisfactory outcome, in combination with the other segmentation 
results. 

In order to evaluate the performance of the visual segmentation methods, 
that is, the shot boundary detection methods presented in Sect.|3 the following 
performance criteria are used [2|: 



„ ,, relevant correctly retrieved shots 

Recall = — ^ ^ 

all relevant shots 

relevant correctly retrieved shots 

Precision = ^ ^ ^ 

all retrieved shots 



Nc 

Nc + Nm 

iVc 

N, + Nf 



( 21 ) 

( 22 ) 



where N^. denotes the number of correctly detected shots, is the number of 
missed ones and Nf is the number of falsely detected ones. For comparison pur- 
poses and to illustrate the strength of combining different methods and fusing 
results, the above criteria are also measured for the single shot detection meth- 
ods presented in Sect. 0 Results for the single cases as well as the combined 
one are presented in Table [D Adaptive thresholding that leads to the decision 
about shot boundaries is performed using two different lengths for the local win- 
dows: W = 2> and W = 5. It can be observed that the combined method attains 
the best results for W = 5. No false detections are made and the missed ones 
are rather few even under dissolve camera effects. The color vector histogram 
difference method is inferior in performance compared to the color frame differ- 
ence method because histograms do not account for spatial color localization. 
However, the histogram approach is better under illumination changes. To il- 
lustrate the discriminative power of all temporal difference metrics considered 
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Table 1. Recall and Precision values achieved by the Shot Boundary Detection meth- 
ods. 



Method 


TV = 3 


IF = 5 




Recall Precision Recall Precision 


Color Frame Difference 


0.7047 0.5866 


0.8456 0.7975 


Color Vector Histogram Difference 0.3356 0.2155 


0.5705 0.4271 


Combined Method 


0.9329 0.9858 


0.9396 1.0 



in the shot cut detection methods, i.e., the color frame difference metric FDt^ 
the color vector histogram difference metric HDt and the combined difference 
metric ODt, Fig. ^is given, where parts of these temporal difference metrics 
are shown. One can easily observe how more easily distinguishable are peaks in 
the third plot, even in parts of the video sequence where a lot of action and 
movement is dominant, and how less varying are the rest values. 



1d Temporal Metrics 




1 i ^ ^ ^ 1 



§ 0.5 







480 490 500 510 



520 

t 



530 540 550 



560 



Fig. 1. Evaluated Id temporal difference metrics: FDt (top plot), HDt middle plot, 
ODt bottom plot, for a certain temporal part of the input sequence. 



Finally, the performance of the method according to scene boundary deter- 
mination is investigated. The sequence under study contains 18 different scenes 
being either dialogue ones, or action ones, or commercials or the serial logo 
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displays. During boundary detection, those shots that exhibit the same speaker 
speaking or the same music part are combined together into a scene. The bound- 
aries of the scenes are further extended according to shot boundaries. For exam- 
ple, if the same speaker is found to be speaking during frames 100 and 200, while 
shot boundaries have been detected to exist to frames 85 and 223, then scene 
boundaries are further extended to those, based on the enhanced performance 
of our shot boundary detection. Cases have been observed that extent scene 
boundaries to even a different speaker or music segment. Thus, dialogues may 
be identified if the speaker changing points in a scene are rather high. Results 
show that 13 out of 18 scenes are correctly detected, 12 are false detections (an 
actual scene is recognized as more than one due to the non-overlapping of speaker 
boundaries, music boundaries and shot boundaries), while 5 scene boundaries 
are missed. The performance is good considering that simple rules are imposed 
for scene boundary detection. Further investigation for scene characterization 
as well as incorporation of other analysis tools to define more semantic primi- 
tives and enhancement of methods attaining a suboptimal performance will be 
undertaken. 

6 Conclusions 

Content analysis and indexing systems offer a flexible and efficient tool for further 
video retrieval and browsing, especially now that distributed digital multimedia 
libraries have become essential. When such tasks combine semantic information 
from different data sources (auditory, visual, textual) through multimodal in- 
teraction concepts, enhanced scene cut detection and identification is possible. 
In this paper, a scene boundary detection method has been presented that at- 
tains promising performance. Both aural and visual sources are analyzed and 
segmented. The audio types used are speech, silence and music. Video segmen- 
tation into shots is performed by a remarkably efficient method that combines 
metrics used by the two distinct approaches. Interaction of the single source seg- 
mentation results leads to the determination of scene boundaries and the partial 
scene characterization. 
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Abstract. A new image generation scheme is introduced. The scheme 
linearly fuses multiple images, which are differently focused, into a new 
image in which objects in the scene is applied arbitrary linear processing 
such as focus(blurring), enhancement, extraction, shifting etc,. The nov- 
elty of the work is that it does not require any segmentation to produce 
visual effects on objects in the scene. It typically uses two images for the 
scene: in one of them, the foreground is in focus and the background is 
out of focus, in the other image, vice versa. A linear imaging model is 
introduced, based on which an identity equation is derived between the 
original images and the desired image in which the object in the scene is 
selectively visually manipulated, and the desired image is directly pro- 
duced from the original images. A linear filter is derived based on the 
principle. The two original images which are applied linear filters are 
added and result in the desired image. Various visual effects are exam- 
ined such as focus manipulation, motion blur, enhancement, extraction, 
shifting etc,. A special camera is also introduced, by which synchronized 
three differently focused video can be captured, and dynamic scene can 
also handled by the scheme. Realtime implementation using the special 
camera for processing moving scenes is described, too. 



1 Introduction 

Real images are manipulated to enhance the reality of the images in applica- 
tions such as post-production and computer graphics. For example, image based 
rendering (IBR), that is, generation of novel images from a given set of refer- 
ence images is intensively investigated in the field of computer graphics. View 
interpolation P, view morphing^, light field rendering |2| are among the IBR 
techniques. IBR usually generates new views of the object from a given set of 
images captured at different positions. The technique we propose in this pa- 
per is a novel IBR technique that manipulates visual effects such as focusing, 
enhancement etc. which are selectively applied to the objects in the scene. 

Producing visual effect to the object in the image is usually handled by seg- 
menting an image into different objects which are then applied special effects 
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and integrated into a new image. Such manipulation of visual effects needs seg- 
mentation that is hard to be automated, and the user has to manually correct 
the segmentation. 

We propose a novel approach to manipulation of visual effects which is very 
different from the intuitive one described above. The proposed scheme uses mul- 
tiple (typically two) differently focused images captured at the fixed position. 
By fusing the multiple images, it generates visual effects which are selectively 
applied to a object in the scene. For instance, one of the two originally acquired 
images has the foreground object in focus and the background object out of 
focus, while the other vice versa. Our proposed scheme arbitrarily manipulate 
visual effects on objects in the scene only by linear filtering. It can produce 
visual effects selectively onto the foreground or background. The visual effects 
achievable are linear operation such as focusing, blurring, enhancement, shifting 
etc,. The scheme only needs linear spatial invariant filtering and it generates the 
target image from the original images. It is notable that it does not need any 
segmentation nor 3D modeling, although it generates object-based visual effects. 
In our previous work M, we have shown the principle based on iterative re- 
construction. In this paper, we show reconstruction by using linear filters which 
are uniquely determined. 

So call image fusion ^ has been used to merge multiple images such as 
those from various type of image sensors. Image fusion can also fuse differently 
focused images into an all focused image, but it can not handle object based 
special effects. The proposed method in this paper differs from the conventional 
image fusion because it is able to achieve object-based special effects. 

Differently-focused images have been used for depth from focus and depth 
from defocus, in the field of computer vision that computes the depth of the 
scene (ex. m)- Our proposal also makes use of differently-focused images, but 
it differs because it is aimed at image generation, and not at depth computation. 

The linear imaging model has been also used to represent transparently over- 
lapped images such as a view through glass window which has reflection. Separa- 
tion of differently focused overlapped transparent images was formulated based 
on the linear imaging model, and a linear filtering approach was investigated [7]. 
Its formulation is the same as our approach. But, we deal with usual scene, not 
limited in transparent scene, and moreover we can generate object-based various 
effects, which includes extraction of objects. 

2 Arbitrarily Focused Image Generation 

2.1 A Linear Imaging Model 

For the time being, we concentrate on the manipulation of focus among the 
operations which are achievable. For simplicity, we describe our proposed method 
for the case when we use two differently focused images. 

Suppose a scene consists of foreground and background objects. We define 
fi{x) and f2{x) as foreground and background texture, respectively. The two 
differently focused images which are acquired are defined as gi{x) and g2{x). In 
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Fig. 1. Image formation by a linear imaging model 



gi{x), the foreground is in focus and the background is out of focus. In 32(2;), 
the background is in focus and the foreground is out of focus. Then, assuming a 
linear imaging model shown in figO gi(x) and 52(2^) are formed as follows: 

f5i(2:) = /i(a;) + /i2*/2(a:) 

\ g2(x) = hi * f(x) + f2(x). 

In gi, fi is in focus and /2 is out of focus. In 32 > vice versa, hi and ft.2 are the 
blurring functions applied to fi and /2 in g2 and gi, respectively. * represents 
the convolution. The amount of blur of hi and /12 are caused by the optics of 
the camera and the depths of the objects. 

Focusing can be manipulated by changing the blurring functions applied to 
fi and f2- Then, if the blurring functions ha and /if, are applied to fi and /2, 
respectively, we are able to generate a new image / with focusing that is different 
to the originals, gi and 32 • The equation of / is given by: 

fix) = ha* fi(x) + hb* f2{x) (2) 

In eq. m,ha and /if, are arbitrarily controllable, which can differ from hi and /12. 
When ha = hb = S{x), f is an all focused image. (S(x) is the delta function such 
that S(x) = 1 if a: = 0 , S(x) = 0 otherwise.) When we set ha = S(x) and vary 
hb, f has a variably blurred background while keeping the foreground in focus. 

In order to reconstruct /, usual thinking is to segment the foreground fi and 
the background /2 in the original images, then to apply visual effects on them 
and fuse them. However, precise segmentation is difficult and presents a serious 
barrier to automating the processing. 

In this proposal, we take a different approach. From the equations in eq. 
we obtain the following equations: 

(hi* gi{x) - g2{x) = {hi * h2 - S) * f2{x) 

\h2* g^ix) - gi{x) = {hi * h2 - S) * fi{x). 



( 3 ) 
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Convolving these two equations with hi, and ha, respectively, then adding the 
corresponding sides of the resulting two equations, and finally using eq. 0 results 
in the equation given below because convolutions with ha, hh and {hi * /i2 — 1 ) 
are interchangeable. 

{ht *hi- ha) * gi{x) + {ha * /12 - ht) * g2{x) = {hi * h2 - S) * f{x). ( 4 ) 

The above equation excludes fi and /2, and is an identity between gi,g2 and 
/. The blurring functions, ha and hb, are controlled by the user. The blurring 
functions, hi and /i2, occur when the images gi and g2 are acquired, which can 
be estimated by pre-processing of either image processing-based estimation or 
camera parameter based determination. The only unknown in eq. ® is f{x). 
Therefore, by solving the linear equation ®), f{x) is obtained directly. 

There are two ways to solve the linear equation 0 . One is the iterative 
approach and the other is inverse filtering approach. In our previous paper |^, 
iterative approach was utilized. In this paper, we introduce a linear inverse fil- 
tering solution which leads to more accurate and faster processing. 

2.2 Reconstruction by Linear Filtering 

We introduce inverse filters applied to the acquired image to obtain an arbitrar- 
ily focused image f{x). Firstly, we take Fourier transform(FT) of the imaging 
models to represent them in Fourier domain. The FT of eq. o are expressed by 

Gi=Fi + H2F2 

G2 = HiFi + F2 

where Gi, Hi, Fi{i = 1 , 2 ) indicates the Fourier transform of gi,hi, fi respectively. 
Similarly, the FT of eq. m is expressed by 



F = HaFi + HbF2- 



( 5 ) 




Rh 

Fig. 2. Reconstruction of arbitrarily focused image from two differently focused images 
using inverse filters 
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Secondly, we solve eq. for Fi and F2, and then find F by substituting the 
solution for eq.Q- We assume that the blurring function hi is either gaussian 
or cylndrical. According to whether (1 — H1H2) is zero or not, we solve eq.® 
as follows: 



(i) 1 - H1H2 ^ 0 

In all the frequency component except DC component, H\F[2 yf 1 is satisfied. 
In this case, the F can be given by the filtering below. 



Hg-HbHi Hb-HgH 2 

1 - H1H2 ^ 1 - H1H2 



( 6 ) 



(ii) 1 -HiH 2 J ^0 

The DC component of F can not be obtained because the denominator of the 
above equation is zero at DC. This means the DC component is what we call 
ill-conditioned in the sense of general image restoration problems. However, 
in our specific problem of reconstruction of arbitrarily focal images using two 
differently focused images, we can find the DC component of F by taking the 
limit of the filters of Gi and G2 in eq.® at the DC. Applying I’Hospital’s 
theorem, we obtain the limit value at the DC for both cases of gaussian and 
cylindrical PSFs as follows: 



Hg-HbHi Rl + Rl-Rl 

I-H1H2 Rj + Rl 

Hb-HgH2 Rl + Rl-Rl 
I-H1H2 Rl + Rl 

where ^ and 77 are horizontal and vertical frequency and Ri (i 
are the blur radiuses. 



(7) 

(8) 

1,2, a, 6) 



From the results above, the desired image F can be obtained as 



F = KgGi + KiG 2 



(9) 



where Kg and K}, are the linear filters represented in the frequency domain, 
which are expressed by 









td2 I p 2 td2 
Hi + Jtg 

Rj + Rl 

Ha - HbHi 

1-HIH2 ’ 

p 2 I td2 t>2 

R\ + Rl 
Hb - HaH 2 
' 1 - H1H2 ■’ 



if ^ = 77 = 0 
otherwise 

if ^ = 77 = 0 
otherwise. 



( 10 ) 



( 11 ) 



Rl is the radius of the blur circle of the foreground when background is in focus. 
i?2 is that of the background when the foreground is in focus. R\ and R2 can 
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be estimated by our previously proposed pre-processing method mag. Ra and 
Rb are the parameters chosen by the user. Hi {i = 1,2, a, 6) are determined by 
the blur radiuses and the property of PSFs. Thus, the FT of the linear filters 
can be uniquely determined and the desired image F (in frequency domain) 
can be directly reconstructed using linear filtering (eq.0). The diagram of the 
reconstruction is shown in figEl Note that no segmentation is required in this 
method. Finally, by applying the inverse FT to F, we obtain the desired image 
/• 





Fig. 3. Frequency characteristics of the inverse filters (J?i =3, R 2 = 2, Ra = 0) 



Frequency characteristics of the linear filters Ka and are shown in figEl 
when i?i and R 2 are 3 and 2 pixels, respectively, changing Rj, from 0 to 4 
pixels while Ra is constantly 0 pixel. The generated images are such that the 
background is variably blurred while the foreground is kept in focus. The figure 
Elshows the case in which the blurring PSFs (his) are gaussian. Note that a noise 
amplification, which is generally a critical issue, does not occur as large as an 
general inverse filtering method may have, because the characteristics at higher 
frequency converge to 1. 



3 Experiments for Real Images 

For the experiments using real images, we need to pay attention to the two things 
listed below which are required for pre-processing. 

(1) Blurring parameters Ri of hi and R 2 of /12 should be estimated because they 
are not known. 

(2) Image sizes of gi and 52 should be made equal because the different focus 
makes the image view slightly different. 

There are two ways to obtain these parameters; one is fully image process- 
ing based estimation and the other is camera parameter based determination. In 
the image processing based estimation, discrepancies in the size and the blurring 
between the two images are detected by image processing, in which hierarchical 
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(a)The near in focus (gi) (b)The far in focus (32) 

Fig. 4. Real images used for experiments 



matching is applied to size difference estimation, and Rs of hi and /12 are esti- 
mated by coarse to fine way. The hierarchical registration taking into account 
the discrepancies of size, position and rotation is described in detail in our paper 

imni . 

In the camera parameter based determination, camera characteristics such 
as discrepancies of image size vs focal length and blurring vs focal length are 
measured using test charts placed at various depths, and put into a look-up- 
table in advance. Because the discrepancies of the size and the blurring between 
the two images are uniquely determined by the camera focal length, they are 
obtained just by the focal length parameters when the images are taken. 

We use real images of fig0 (a)(b) for experiments. Using the pre-processing 
fully based on image processing, image g 2 is slightly enlarged. The blurring R 2 
of the far object in gi and the blurring i?i of the near object in g 2 are estimated 
that Ri = 4.73, i ?2 = 4.77. 

After the pre-processing, we can generate arbitrarily focused images from the 
acquired images by filtering. Some are shown in figEl EJa) shows the all-focused 
image in which both the far and the near are in focus {Ra = Rb = Opixels). 
In 0(b) the near object is slightly blurred {Ra = 2,Rt = Opixels); compared to 
the original figEKb), the near object is slightly restored while keeping the far 
objects in focus. 0(c), the far objects are much blurred {Ra = 0,Rb = lOpixels); 
compared to figElE^)) the background is much blurred while keeping the near 
object in focus. 

As shown in the figures, the proposed method for manipulation of the focusing 
works well for real images. Again, the method does not need any segmentation. 
Our proposal is based on the linear imaging model which assumes depth of the 
scene changes stepwise. The real scene never satisfy this assumption. In fact, the 
scene of the experiments are obviously not two planar layers. Although the model 
is rough approximation, it is verified that the method can obtain satisfactory 
results. 
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(a)All-focused image 



(b)near object is slightly blurred. 




(c)background is much blurred. 



Fig. 5. Various arbitrarily focused synthesis images by linear filtering 



4 Various Manipulations 

Because the operation ha and /if, can be any liner processing, we can achieve 
various manipulation of visual effects on objects in a image by the same way. 
For example, enhancement, shifting, extraction, etc. Among them, a result of 
enhancement is shown in fig El If ha is enhancement filtering, the object / is 
selectively enhanced. As shown in fig El Only the near object is enhanced while 
the background is kept the same as that of the all focused image. The difference 
is clearly visible on the textures of the object. Example of shifting and extraction 
are shown in our paper jO]. 

5 Real-Time Implementation 

To implement the the proposed technique to the scenes of moving objects for 
a real-time system, it is required to capture multiple differently focused image 
at the same time. We have developed a special camera which can acquire three 
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Fig. 6. Enhancement of a object 



differently focused video for the same scene called the multi-focus video cam- 
era. With this multi-focus video camera, we have built a real-time system of 
arbitrarily focused moving images. 



5.1 Multi-focus Video Camera 

FigHillustrates structure of the multi-focus video camera. The light beam from 
the objects passes the lens and then it is divided into three directions by the 
beam splitter. Focal lengths can be adjusted by moving each CCD camera at 
the beam axis. 




(a) Structure of multi-focus video camera 




(b) Multifocus camera 



Fig. 7. Multi-focus video camera 
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Fig. 8. Diagram of the real-time arbitrarily focused moving images system 



5.2 Real-Time Arbitrarily Focused Moving Images System 

As illustrated in fig. 0 the system consists of multi-focus video camera, 4- 
segmented display, and SGI workstation Onyx RIOOOO (195 MHz). By using 
multi-focus video camera, we capture two differently focused images; one is fo- 
cused on the near object and the other one is focused on the far objects. The 
two images are integrated into one image using 4-segmented display, so that both 
images can be inputed to the workstation exactly at the same time. Then, the 
workstation processes the input images, generates an arbitrarily focused image 
and displays the result. In this system, we also can adjust the focal lengths of 
the synthesis images in real-time. 

5.3 Experimental Results 



Fig. 0 shows the experimental results of the real-time system. Fig. EKa)(b) are 
the input video images, i.e. near-focused images and far-focused images, and 
fig. 0(c) is the synthesis video images. Here, at the beginning of the experiment 
foreground and background objects are in focus, then the background objects 
are blurred gradually while keeping the foreground object in focus. For image 
size of 128 x 128 pixels, the real-time system can generate arbitrarily focused 
moving images at around 3 frames/second. 

In reconstruction using iterative approach 0, the processing time depends on 
the blur radii, such that the processing time increases when the blur radii become 
larger. It is not favorable for a real-time system, because the system will have 
different speed rates when the user change the synthesis image’s parameters. 
But, by using linear filtering approach, we can have a faster processing and the 
problem mentioned above can be settled. 
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(a) Near-focused (b) Far-focused (c) Arbitrarily 
image image focused image 



Fig. 9. Samples of arbitrarily focused moving images 

6 Conclusion 

In this paper, we show a novel approach to producing object-based visual effects. 
We show spatially invariant linear filtering applied to two differently focused 
images generates various visual effects selectively produced on objects in the 
scene. As long as the visual effect is linear processing, the method can produce 
such object-based effects. Again, the method does not need any segmentation. 

In order to apply the proposed technique to the scene of moving objects, it 
is required to capture multiple differently focused images at the same time. We 
developed a special camera which can acquire three differently focused video for 
the same scene, and the realtime implementation using the camera. 
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Abstract. Representing general images using global features extracted 
from the entire image may be inappropriate because the images often 
contain several objects or regions that are totally different from each 
other in terms of visual image properties. These features cannot ade- 
quately represent the variations and hence fail to describe the image 
content correctly. We advocate the use of features extracted from im- 
age regions and represent the images by a set of regional features. In 
our work, an image is segmented into “homogeneous” regions using a 
histogram clustering algorithm. Each image is then represented by a set 
of regions with region descriptors. Region descriptors consist of feature 
vectors representing color, texture, area and location of regions. Image 
similarity is measured by a newly proposed Region Match Distance met- 
ric for comparing images by region similarity. Comparison of image re- 
trieval using global and regional features is presented and the advantage 
of using regional representation is demonstrated. 



1 Introduction 

Image databases have been generated for applications such as criminal identifi- 
cation, multimedia encyclopedia, geographic information systems, online appli- 
cations for art articles, medical image archives and trademark. The volume of 
these databases is expanding drastically. Effective image indexing and retrieval 
techniques then become ever more important and critical to facilitate people 
searching for information from large image databases. 

It is generally agreed that image retrieval based on image content is more 
rational and desirable. There has been intensive research activity in Content- 
Based Image Retrieval (CBIR) systems P, |2|, |2|> |S|, [Il|, jlH], |2^ . 

Many CBIR methods use features extracted from the entire image. However, 
for general images depicting a variety of scene domains, such global features will 
show their limits in representing the image content correctly because the images 
often contain several objects or regions that are totally different from each other 
in terms of visual image properties. Hence, we advocate the use of features 
extracted from image regions and an image is represented by a set of regions. 
The regions may be obtained by segmenting the image using color, texture or any 

R. Klette et al. (Eds.): Multi-Image Analysis, LNCS 2032, pp. 238-^^^ 2001. 
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other image properties. Various image properties in a region are then extracted to 
represent that region. These properties are represented as feature vectors, which 
are used as the region descriptor. Finally, the entire image is represented by a 
set of such regions with region descriptors consisting of extracted image features. 
In the present work, an image is segmented into “homogeneous” color regions 
using a histogram clustering algorithm. Each image is then represented by a 
feature set consisting of region descriptors. The region descriptors are made up 
of feature vectors representing color, texture, area and location of regions. Image 
similarity is measured by the Region Match Distance (RMD) which is defined 
based on the Earth Mover’s Distance {EMD) adopted for regions. A comparison 
of image retrieval using global features and regional features is presented which 
demonstrates the advantage of using regional representation. 

The rest of the paper is organized as follows. Section El discusses the image 
representation issue. Section 0 describes a content-based image retrieval system 
using regional representation. In section E] experimental results are presented 
to show the comparison between using global features and the regional features 
for retrieval. Finally, in section 0 the present work is concluded and the future 
work is proposed. 



2 Image Representation: Global vs. Regional 

Content-based image retrieval methods that use features extracted from the 
entire image can be considered as a global approach. Using global features has 
some weaknesses when dealing with general images which often contain several 
objects or regions that are totally different from each other, and each object has 
its own set of attributes (see Figure 0 ). The global features cannot reflect the 
variation of image properties among regions and thus fail to describe the image 
content correctly. Nevertheless, global feature extraction is much more straight 
forward and fast. In many situations, global features are sufficient to achieve 
acceptable image retrieval performance. 




church flower street airplane 



Fig. 1. Images containing different objects of interests 



In our work, we propose to use regional representation of image, and retrieval 
is based on region similarity. The overall image similarity is determined from 
region similarity by adopting the Earth Mover’s Distance proposed in EOI- This 
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can be seen as a step towards higher level descriptions of an image. In the ideal 
case where a region corresponds to an object, the representation of the image is 
then in terms of objects. The image can then be represented by their perceptual 
organization. This is a step closer to image representation by image semantics. 

Recently, some research work in this area has been reported. Methods such 
as RRD 131, CPCjHl, MCAG|T3, and CRT|22I are proposed to represent color 
images. Most of them are based on global color histogram with added spatial 
domain information. Strictly speaking, these are not really region-based fea- 
tures. In our approach, the features are extracted from image regions. Also, this 
region-based image representation is believed more similar to the process in hu- 
man perception. For example, a human often observes an image with a focus of 
attention. The focus is usually on the object(s) of interests, such as building, 
flower, and airplane shown in Figure E As a result, a human groups images 
according to the object(s) of interests. He/She will also hope to retrieve images 
based on these objects. A region-based description allows us to search images 
based on objects or regions, thus enabling the focus of attention. Global image 
description can still be achieved by a collection of local regional descriptions to 
allow search by the entire image content. Motivated by the above considerations, 
a region-based image retrieval approach using features of regions is developed. 



3 Content-Based Image Retrieval System Based on 
Regional Representation 

Developing a region-based image retrieval (RBIR) system involves the follow- 
ing main tasks: (i) image segmentation by some chosen criteria; (ii) feature ex- 
traction from regions; (iii) construction of region feature sets; (iv) determining 
similarity between query feature set and the target image feature set. Tasks (i) 
- (iii) are needed to construct the indexing system for RBIR. This is an off-line 
process. In the on-line process, a query image is presented and all the tasks are 
performed. The I most similar images are retrieved in descending order of simi- 
larity, where I can be specified by the user. The overview of the RBIR system 
is illustrated in Figure 0 

3.1 Region Extraction 

Image segmentation partitions an image into some “meaningful” regions which 
are assumed to be homogeneous in some sense, such as brightness, color, texture 
and etc. In this work, a color-spatial space histogram clustering segmentation 
algorithm is used ng. It is well-known that most existing image segmentation 
algorithms cannot produce consistent segmentation for the image of the same 
scene captured under different illumination conditions and also may not produce 
consistent and accurate region boundary. This has discouraged many researchers 
to consider image segmentation as a pre-processing step in their CBIR system. 
We believe these problems can be alleviated by proper choice of segmentation 
criteria and by placing more emphasis on large regions. 
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Off-line process 




Fig. 2. The region-based image retrieval system 



Due to the difference between human perception of colors and computer’s per- 
ception of colors by the three primaries RGB, equal changes in the RGB space 
do not necessarily result in equal noticeable changes in the human perception 
m, in]. Hence, we transform the color pixel from RGB space to the CIE-La6 
color space in which the perceived differences between individual nearby colors 
correspond to the Euclidean distances between the color space coordinates. In 
this color space, the L correlated with brightness, a with redness-greenness and 
b with yellowness-blueness. 

The segmentation of the color image is then performed in this perceptually 
uniform CIE-La6 color space. The coordinates of each pixel are appended into 
the Lab color space to include the spatial information. As a result, a color- 
spatial feature space is constructed. This space is quantized to produce the 
histogram space on which the clustering algorithm operates. Too coarse quan- 
tization will lead to incomplete segmentation while too fine quantization will 
result in over-segmentation, both of which can adversely affect the retrieval per- 
formance. Through experimentation, quantization level for the 5 dimensional 
histogram space is L = 4, a = 6 = 25, and X = Y = b. The image regions are 
then formed by the clusters obtained from the histogram space clustering algo- 
rithm. Hole-filling operation is performed to remove small regions within large 
regions. 

After image segmentation, R largest regions are selected such that = 

ws, where Si represents the pixel number in ith region, s represents the total 
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number of pixels in an image, while w is the percentage threshold. In this work, 
w = 0.95, i.e. the first R larger regions altogether covering up to 95% of image 
area are taken. Thus R sub-images can be formed, each of which contains only 
one “homogeneous” region (see examples in Figure 0. 




Fig. 3. Examples of region extraction by color image segmentation 



3.2 Regional Representation 

Many visual properties can be used to characterize an image region. The com- 
monly used visual properties are color, texture and shape. In the present work, 
color and texture are used to represent regions in the images. Shape will be 
considered in future. The next question is how to represent color and texture 
properties of a region. Typical approaches use single color, color pairs, color 
mean, and color histogram to index the color information contained in the im- 
ages. Each approach has different advantages and disadvantages. Over the past 
decades numerous approaches for the representation of textured images have 
been proposed PH, p3]. Here, we follow the work reported in in which 
a 3-level wavelet decomposition was used to derive texture features from the 
wavelet transformed coefficients. In retrieval by regions, the locations of the re- 
gions and their sizes can be important. This is because “more important” regions 
are usually found near the center of an image. Also, the size of a region can be 
a criterion when defining similarities. Hence, the following image properties are 
extracted from regions: 
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Color feature: 

— Color mean: Color mean is a basic image feature. It is considered to be 
less effective in representing images. If this mean is used to represent a 
homogeneous color region, it is then meaningful. Let Cj denotes the color 
value of a pixel, rrid {i = 0, ...,n) denotes mean value of color c of the ith 
region f2i respectively, then 

j e ) /" ■ 1 \ /I \ 

mci = — ,(j = l,...,p 1 

P 

where p represents the number of pixels in the ith region. The color value 
here is a vector (L, a, 6)^ 

— Color histogram: Color histogram represents of the color distributions of 

a region. This regional color feature can be obtained by quantizing the L, a, 
b, the color space into bins. The quantization level was set to qi = 2, = 9, 

Qb — 9. Thus the following color feature vector consisting 162 elements is 
defined, 

fc = (^ij ■■•j ^162)^ ( 2 ) 

where hi (i = 1,...,162) represent the normalized histogram value for ith 
bin of the Lab color histogram respectively. 



Wavelet texture feature: A three-level wavelet decomposition using “Daub4” 
wavelet is performed in the L, a, and b images. Means and standard deviations 
of the approximation coefficients, the horizontal, vertical and diagonal detail 
coefficients at each level are used to construct the wavelet feature vector 
This feature vector has 3x2x4x3 = 72 dimension: 

fw = (Cklj 02) 072 )^ (3) 

Region Location: The location of the region is represented by a bounding box 
which is subdivided into 3x3 sub-boxes. These sub-boxes represents a region 
location vector (RLV) of 9 elements, and the elements represent top-left, top- 
middle, top-right, middle-left, middle-middle, middle-right, bottom-left, bottom- 
middle, and bottle-right regions respectively. The RLV is defined as follows. 

Iw = (^ 1 , ^4, h, hi hi hi hY' (4) 

The value for each element is defined as the ratio between the area of the 
region falling in the location represented by this element to the whole area of 
the region. 



area An location A 
area-of -region 



(* = 1 , 2 ,..., 9 ) 



(5) 
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AreaVector: An AreaVector for each image is defined to reflect the area dis- 
tribution among its regions. Each element in the vector is the ratio of the area 
of a region to that of the entire image. For example, if an image is segmented 
into three regions, the AreaVector of it will contain three elements with each is 
the ratio between the area of one region and the area of the whole image. 



AreaV ector = [r i , r2 , . . . , r/j] , 



area-0 f jregionJ, 
area-0 f -image ’ 



(z=l,2,...,i?) 



( 6 ) 



It is clear that = 1, {i = 1,2, ...,ii). Each element of Area Vector will 
reflect the importance of the region to the whole image in terms of area. 



FeatureSet Construction: Now the regions have been represented by feature 
vector for color or texture and RLV for region location. A FeatureSet represent- 
ing the whole image can be constructed based on these region descriptors and 
the AreaVector. For an image with R regions, the FeatureSet will be: 

FeatureSet = {fi, Iwi, f 2 , lw 2 , *'2; fn, W, 'Tfl} (7) 

where fi, {i = 1, 2, .., R) are the RLV for region i which may be color mean, 
color histogram, or wavelet texture feature; Iwi, (* = 1,2, ..,i?) are the RLV for 
region i; ri, (i = 1, 2, ..., R) are the AreaVector element for region i. 

3.3 Regional Similarity Measurement 

In this work, a distance measure - the Region Match Distance (RMD), which is 
based on the Earth Mover’s Distance (EMD), for matching two feature sets is 
proposed. The EMD is a distance measure proposed in m- It aims to measure 
similarity between two variable-size descriptions of two distributions. Basically, 
it reflects the minimal cost that must be paid to transform one distribution into 
the other one. The EMD has many desirable properties. It is more robust in 
comparison to other histogram matching techniques for measuring two distribu- 
tions, in that it suffers from no arbitrary quantization problems due to the fixed 
binning of the latter. Also, it allows for partial match between any two distri- 
butions. This makes the EAID very appealing for matching two images where 
there may not be equal number of regions in these two images. A maximal match 
is sought. 



Prom EMD to RMD: An image can be viewed as a distribution over the 
two dimensional spatial domain. After the segmentation, only the prominent 
regions are extracted from the original distribution and are used to form the 
FeatureSet. In the FeatureSet, each region is represented by a single point in the 
relative feature-location coordinate system, together with a weight - an element 
in the AreaVector of the image - that denotes the size of that region. By using 
the EMD, to compute the distance between two feature sets is to emulate one 
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FeatureSet with the other one in terms of their AreaVector match. The RMD 
is proposed based on this idea. It reflects the minimal work needed in matching 
the two image distributions. 



Definition of RMD: The RMD is computed based on the solution to the 
traditional transportation problem as in the EMD jZj. First, we need to define 
an elementary distance between two regions, called the Region Distance(i?D): 

rd=\\vq- VdW/woi (8) 

where Vq is a feature vector of a region from the query and Vd is a feature 
vector of a region from an image in the database. 

Here Wqi is a weight which reflects the location overlap of two regions. It is 
defined as follows: let l^q, Iwd are sets of RLV for the query image and the image 
from the database respectively, 

Wol — 2 — ( 9 ) 

where Iqi C Iwqt ^di C Iwd^ — I 5 2, .., 9). 

Using Wol as the weight for RD means the more two regions overlap, the 
smaller the distance. Then, let F, F represent the sets of regions from two images 
P and Q, we define the RMD between the two distribution as: 

RMD[P,Q) = (10) 

where Cij(i G d>, j G F) is the amount of “flow” from region i in the first image 
to region j in the second image, C is the set of all permissible flow cij, and dij is 
the RD between region i of one image and region j of another image. “Flow” here 
is a measure of how many parts of one region that can be transformed/matched 
into another region, which can be measured in terms of absolute size of the region 
(number of pixels) or area of the region relative to the whole image. 

The solution for RMD is obtained by solving the transportation problem to 
obtain the combination of Cij such that the RMD equation above is minimized. 
In other words, it matches the regions in the two images by pairing such that 
the overall dissimilarity measure between the two images is as small as possible. 
The values of c^- is subject to the following constraints: 



Cij > 0 ; iG<F, jGF (11) 

je'f ied) 

= Vj ; J e ^ (13) 

ie# 

Ctj<Xi ; i G-P 



(14) 
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where Uj G Y(j G S'), Xi G X{i G ^), X, Y are ^rea Vectors for the two 
images. 

The first constraint above maintains that the flow is unidirectional, from 
i to j, to avoid negative minimal value caused by a flow from the opposite 
direction. The next constraint specifies that the bigger image, i.e. the image 
with more regions, is always considered as the “supplier” of flow. This allows for 
partial matches between images of unequal region numbers, and maintains the 
symmetrical relation that RMD{P,Q) = RMD{Q, P). The last two conditions 
require that all parts of the “consumer” (smaller image) must be matched to a 
flow, while the flow from the supplier cannot exceed the number of parts/pixels 
it can provide. These also mean that the smaller image has to be fully emulated 
by the larger image, but not vice versa. 

Thus the RMD extends the distance between two images to distance between 
two sets of regions even when the two sets are of different sizes. It can reflect 
the notion of nearness without the quantization problems that other histogram 
matching techniques have. Also it allows for partial match between two images 
naturally. The characteristics of RMD is similar to EMD. 



An example of image pair matching using the RMD: An example of the 
matching process for two images containing three and two regions respectively 
are illustrated in figure 0 In the figure, fij is the feature vector for jth region of 
image i {i = 1,2; j = 1,2,3). yj{j G S'), Xi{i G ^) are Area Vectors for the two 
images respectively. dij,{i = 1,2,3; j = 1,2) are the RDs between region ith of 
image 1 and jth region of image 2. Assume that regions 1 and 3 in image 1 are 
similar to regions 1 and 2 in image 2 respectively, which means that dn and 
ds 2 are small. Therefore, to minimize the RMD equation, the flows cn and C32 
should be chosen to be as big as possible; that is, as many pixels as possible are 
matched between the two region pairs. Any unmatched pixels will be matched 
with other unmatched pixels in the next most similar region. This means that 
the RMD will be smaller when the matched regions are of similar size as well; 
an image with a big red region and a small blue one when compared with an 
image comprising a small red region and a big blue one will yield a relatively big 
RMD. Also, bigger regions tend to dominate in similarity comparison because 
they contribute more flow. 

Based on the principles and constraints of the RMD, the image 1 is always 
the one with more regions. It can be either the query image or a candidate tar- 
get image. The selected feature vectors of a region in image 1 is compared with 
another feature vector of a region in image 2 to determine the region-wise similar- 
ity according to visual features and the region overlap, i.e. the RDs. The overall 
image similarity is therefore determined by the RMD which requires maximal 
similarity over all the matched regions. The computation of RMD involves an 
optimization process and hence takes much longer time than computing the 
RDs. 
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Fig. 4. The matching of an image pair by regions using RMD 



4 Experiments 

Image retrieval is performed using color mean, color histogram, wavelet feature 
as region feature respectively. In order to make a comparison with global content 
representation, the popular global color histogram feature is also used. 

Each image in the image database is first segmented into a number of ho- 
mogeneous regions by color. Each region is represented using color and texture 
attributes as described in Section l3^ Thus, three kinds of feature databases, i.e. 
color mean feature, region color histogram feature, and the wavelet feature, are 
constructed. Obviously, if every image is segmented into one region, the global 
feature database can also be generated. As a result, the comparison of Image 
Retrieval using Global and Regional Feature can be done. The global feature is 
defined by the color histogram known as the global color histogram. 

In the experiment, the Average Retrieval Rate(ARii) proposed in is 
used to evaluate the effectiveness and accuracy. ARR is defined as the average 
percentage number of images retrieved from a particular class, given a sample 
image of that class. Given that to represents the number of classes experimented 
on, rii (i = l...w) represents the number of image retrieved for the sample image 
of ith class, (i = l...u}) represents the number of images retrieved correctly 
for that image. 



= (15) 

The retrieval experiments are performed on images coming from ten classes of 
a scaled down Gorel Stock Photos image database (see Table [Q . The ten classes 
are considered to be the most typical kinds of natural images in the database. 
They represent grass and animals, sunset, star and sky, building, sky and plane, 
green scene, lake, sea creature and sea floor, flowers and cars respectively. Some 
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of these images are shown in Figure 0 For each class, 10 images are randomly 
selected as the query of that class, thus a total of 100 test images. 8 most similar 
images {rii = 8) are retrieved for each of them according to selected features. 
The experiment results are tabulated in Tabled The number in the table is the 
average number of images retrieved correctly for certain class, i.e. the average of 

n- 



Table 1. Performance evaluation by ARR 



image class 


color mean 


region histogram 


wavelet feature 


global histogram 


grass & animals 


7 


4.5 


5.5 


5 


sunset 


5 


3.5 


6 


3.5 


stars & sky 


6.5 


5 


3 


5.5 


building 


6 


4 


6.5 


5.5 


sky and planes 


6.5 


4.5 


6.5 


6 


forest scene 


7 


5.5 


5.5 


6 


lake 


4.5 


2 


4.5 


3.5 


underwater world 


3 


2.5 


4.5 


2.5 


flowers 


6 


5.5 


6 


5.5 


cars 


5.5 


3 


5 


5.5 


Overall ARR 


71.3% 


50% 


68.5% 


60.6% 



The results in these experiments show that the region-based approach pro- 
duce better performance than the global histogram based method when using 
the color mean features and wavelet texture features. Region color histogram 
seems not suitable because the images have already been segmented into homo- 
geneous regions in terms of color. So the histogram of region does not contain 
too much useful information while color quantization may also introduce other 
errors in adverse. The color mean is a good invariant feature for representing 
the color information of the region. On the other hand the wavelet features are 
not invariant and hence produce lower ARR than the color mean. 

In our experiments, the images are grouped into classes by human observers 
largely based on image semantics. This explains the overall low ARRs. How- 
ever, image classification itself is an issue. Studies on relevance feedback for the 
training of CBIR system may lead to a better way of classifying image databases. 

5 Conclusions and Future Work 

In this paper, CBIR using regional representation is advocated. Features ex- 
tracted from an image region are used to represent that region and an image is 
represent by a feature set of regional feature vectors. Together with the region 
location vector and area vector, the content of the image can then be better 
described. The Region Match Distance(i?MZ)), which is based on the Earth 
Mover’s Distance, is proposed to measure the similarity between the two fea- 
ture sets representing two images. The experimental results shows that using 
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Bsa 
















Fig. 5. Examples of some of the scaled down experimental images and their pre- 
classification (each row forms an image class) 

regional representation is better than using global representation. It also shows 
the importance of using invariant features for region-based image retrieval. Fu- 
ture work will address the feature invariance issues, the similarity metrics, and 
linking of the high-level perceptual concepts to the low-level features. 
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Abstract. In this paper, we have investigated the fusion of surface data obtained by 
two different surface recovery methods. In particular, we have fused the depth data 
obtainable by shape from contours and local surface orientation data obtainable by 
photometric stereo. It has been found that the surface obtained by fusing orientation 
and depth data is able to yield more precision when compared with the surfaces 
obtained by either type of data alone. 



1 Introduction 

Current surface recovery methods have their respective advantages and drawbacks. For 
example, while it is possible to obtain accurate measurements using structured lighting, 
the process can be extremely time consuming. Photometric stereo offers fast and dense 
recovery of the local surface orientations, but the depth values that are calculated by 
the integration of recovered normals may be inaccurate with respect to the true 

depth values. 

To construct a new shape recovery method that can be more robust, efficient and 
versatile than existing methods, we fuse the data obtained by shape recovery methods 
that have complementary characteristics. From previous work Hf], we have decided to 
construct a new shape recovery method by the fusion of depth and orientation data, which 
are respectively obtained by shape from occluding contours and photometric stereo 
method. These two methods have been chosen on the basis that shape from contours is 
able to provide reliable measurements, but unable to recover surface cavities that are 
occluded from the camera by the contours of the object. Conversely, photometric stereo 
provides dense orientation information over the surface, but the depth measurements 
obtained by integration of the surface orientations are relatively scaled to the actual 
depth values. Therefore, the integration of these two methods may be able to yield 
results with higher precision than either one of the methods is able to achieve. 

We have approached the task in the following steps. Firstly, we generate the synthetic 
surfaces and simulate the orientation and depth data as would be obtained by the shape 
recovery methods. From the simulated data, we calculate the weighting functions for 
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the different types of data. The fusing of orientation and depth data is then performed 
according to the weights for the orientation and depth data. Finally, we compare the 
surfaces obtained by fusion of data, as well as the surfaces obtained by photometric 
stereo and shape from contours, to evaluate the performance of the fusion method. 

The block diagram in Fig.[I]shows the steps involved in the integration of photometric 
stereo and shape from contours by fusing orientation and depth data. 




Fig. 1. Block diagram for the integration of shape recovery methods using fusion of depth and 
orientation data. 



In Fig. in 2 represent the original, known surface function. The data generator ex- 
tracts the surface depth values d, and orientation components p and q from z. Noise is 
added to extracted depth and orientation values to simulate the effect of obtaining depth 
and orientation data respectively using shape from contours and photometric stereo. The 
simulated depth and orientation data are put through the concavity hypothesis to de- 
termine the locations where cavities are likely to occur. The weight generator produces 
weights for depth Wd and orientation Wpq data according to the outcome of the concavity 
hypothesis. The fusion process combines the depth and orientation data according to Wd 
and Wpq to generate the fused surface z fused- 

There are two objectives for this work. The first objective is to determine whether 
fusion of the two kinds of data is able to provide more precision than either one of 
these methods. This will be achieved by comparing the recovered surfaces with the true 
surfaces. The second objective is to determine whether there are any observable artifacts 
in the region where different data are fused. The surface recovered by fusion of data will 
be examined to achieve this objective. 
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2 Generation of Surfaces 

We have used MatLab for the generation of the surfaces. Each of the generated surfaces 
has a concave region which is occluded by a neighbouring region on the surface. Such 
surface cavity will not be recovered by shape from contours, yet the surface orientations 
within the cavity can be recovered by photometric stereo. 

Two types of surfaces were used in this work. The first surface is generated by 
addition of Gaussian functions, such that the surface is continuous. It is an example 
of most general surfaces. The second surface, Z 2 , is a polyhedral surface generated by 
intersecting five planar surfaces. It is an example of a simple surface, such as the surface 
from a man-made object. Figure^shows the surfaces zi and Z 2 - 

We have used both the continuous and polyhedral surfaces to evaluate the perfor- 
mance of the fusing method in different cases. 



3 Simulation of Orientation and Depth Data 

The surface orientations as obtained by photometric stereo are provided by the partial 
derivatives of the surfaces. 

The surface normals are calculated by approximating the derivatives of the original 
surfaces z\ and Z 2 with respect to the horizontal and vertical directions. For each surface, 
Zi, the local derivatives in the horizontal and vertical directions, respectively indexed by 
X and y, are given by 



^Zj{x,y) 

Ax 



Zj{x - Zj{x,y) 

Ax 



( 1 ) 



Azj{x,y) _ Zj{x,y + l) - Zj{x,y) 
Ay Ay 



For our surfaces, Ax and Ay are both 0.05 units. 

The orientation data are further simulated by adding random noise, N, to the deriva- 
tives. The resultant orientation data in the x and y directions are respectively given by 



Pz(x,y) = + a ■ N{x,y) , (3) 

qi{x, y) = -p a ■ N{x, y) . (4) 

Ay 
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Fig. 2. (a) Surface zi and (b) surface Z 2 - 
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The simulated photometric stereo results for surfaces z± and Z 2 are shown as needle 
maps in Fig. 0 

The amplitude of the original surfaces can be used directly as the depth values 
obtained by shape from contours, except for concave regions on surfaces. The reason 
is that shape from contours is unable to recover cavities that are occluded from the 
viewing direction by other regions of the surface. Therefore, to simulate the depth data 
as obtained by shape from contours, we need to take into account the method’s inability 
to recover surface cavities. 

The surfaces are discretised into a given number of layers. For each layer, a convex 
polygon for the surface contour is drawn. The layers are then piled up again and the 
convex polygons are joined to form the surface, Zconv Random noise N is added to 
further simulate the depth data obtained by shape from contours, given by 



Zd{x, y) = Zconvix, y)+a- N{x, y). 



(5) 



The obtained surface Zd has a developable patch covering the cavity on the original 
surface, as would be expected when shape from contours is used to recover the surface. 
The simulated surfaces are given in Fig. 0 Figure 0a) and (b) respectively show the 
depth data for zi and Z 2 as might be obtained by shape from contours(). From the figures, 
it can be seen that the cavities have been covered with developable patches. 



4 Detection of Concave Regions 

The compatibility of depth data obtained using shape from contours and orientation data 
obtained using photometric stereo is used to detect the region of cavity on the surface 
to be recovered. The detected concave regions are used to determine the contribution of 
depth and orientation data in the fusion process. 

There are two possible approaches to compare the input data; either by using the 
depth values, or the surface orientations. In this work, the surface orientations have been 
used for the comparison of input data since they are easier to obtain. With the approach, 
partial derivatives are used to provide the surface normals for the surface recovered by 
shape from contours. 

A cross section at j/ = 0, from surface zi is shown in Fig.0 Surface Zi is represented 
by the solid line, and the surface normals obtained by photometric stereo are plotted on 
zi . The surface recovered by shape from contours is represented by the dotted line, and 
the surface normals calculated from the surface recovered by shape from contours are 
plotted on the dotted line. 

From Fig. 0 it can be seen that the orientations of surface normals along the left 
column (x = 1.25) differ significantly, whereas the orientations of the surface normals 
are quite similar along the right column (x = 1.66). The difference of surface normals in 
the left column is caused by the occurrence of a cavity. One way of calculating the angle 
between the two orientation vectors is to use the dot product of the surface normals. 
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(b) 



Fig. 4. Simulated shape from contours depth data for surfaces (a) Zi and (b) Z 2 ■ 
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Therefore, the dot products of the normals can be used to indicate the discrepancies 
between these two types of data. 

The discrepancies caused by noise also need to be taken into account when deter- 
mining the compatibility of the data. In this work, the compatibility function between 
the data is a binary function given by 



c{x,y) 



1 

0 



^psm ‘ ^sfc ^ f 5 

otherwise 



(6) 



where t is the thresholding value that has been chosen such that the data discrepancies 
caused by noise can be avoided. The vectors npsm and risfc respectively represent the 
surface normals obtained by photometric stereo and shape from contours. 




Fig. 5. Cross section and surface normals of original surface and surface obtained by shape from 
contours. 
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In Fig. 0 white pixels represent regions where the data are compatible, and black 
pixels represent regions where the data are incompatible. From the figures, it can be seen 
that the general shape of the cavities have been detected for both surfaces. However, note 
that for some points, the surface orientations within the cavity do agree with the surface 
orientations for the developable patch. Noise added to simulated data has also caused 
the data to disagree outside the cavity. Therefore, the compatibility function will need 
to be improved further to enable more accurate detection of concave regions. 

5 Calculation of Weights for Fusing Data 

In this step, we determine the contribution of the depth and orientation data towards the 
final result based on the computed data compatibility. This is a crucial step in the fusion 
of data, since suitable choice of weights enables the fusion of the complementing data 
to be performed in such a way that the surface given by fusion yields higher accuracy 
than surfaces given by either method alone. 

The depth data obtained by shape from contours are reliable except in concave 
regions. Therefore, the data weighting functions are computed based on the data com- 
patibility function, c(a;, y), which indicates region of cavity. In this work, the weighting 
function for the depth data. 



is the same as the compatibility function, which is 1 when the data are compatible, and 
0 when the data are incompatible. The values are thus defined because the depth data 
are not reliable within concave regions, where the depth and orientation data are likely 
to be incompatible. 

The weighting function for the orientation data. 



has values of 0 in regions where the data are compatible and 1 where the data are 
incompatible, since the orientation data obtained by photometric stereo are more reliable 
than the depth data obtained by shape from contours within surface cavities. 

The weighting functions are selected such that the unknown surface will generally 
be recovered from the depth values obtained by shape from contours, except for the 
concave regions, where the surface will be recovered according to the surface orientations 
obtained by photometric stereo. 

The weighting functions may also take on values other than 1 or 0 to adjust the 
contributions of depth and orientation data towards the fusion process. 




c(x,y) = 1 
c{x,y) = 0 



(7) 




c(x,y) = 0 
c{x,y) = 1 



( 8 ) 
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6 Fusion of Data 



The fusion algorithm is implemented according to the work hy D. Terzopoulos 
It reconstructs the unknown continuous function z by combining the depth and 
orientation data with respect to certain weighting functions. The input parameters are d, 
the depth data, p and q, the horizontal and vertical orientation data, as well as Wd and Wpq , 
the weighting functions for depth and orientation data. In this work, the depth data, d, are 
the noise corrupted depth values obtained using shape from contours. The orientation 
data, p and q, are the noise corrupted orientation data obtained by photometric stereo. 

The fusion algorithm minimises the error function 



E{z) = Ed{z,d,p,q) + Em{z) + Et{z) , 



(9) 



where 



Ed(z,d,p,q) = Wd ■ Sd^^ \z - dl"^ + Wpq ■ Spq^^l^ - pf + \ ^ - qf 



(10) 



Em{z) = Sm I I 



(11) 



Et{z) = 



6‘^z 6‘^z 6‘^z 2i , , 



(12) 



In the above formula, Ed is the data error function, which specifies how well the 
unknown surface z conforms to the input depth and orientation data at discrete grid 
positions. The second term, Em, represents the membrane function. It specifies fhat fhe 
surface 2 ; should be continuous. The fhird term, Et, represenfs fhe fhin-plafe funcfion, 
which specifies fhat the surface z should have small changes in its curvature, that is, 
the surface should be smooth, with no sudden peaks. The values Sd, Spq, and Sm 
are the coefficients of the depth error, orientation error and the continuity constraint, 
respectively. These values can be assigned by the designer of the algorithm. In our case, 
the coefficients have all been set to 1 . 

The conjugate gradient descend method is used to iteratively minimise the error 
function, such that the result is obtained when the value of the error function is at its 
minimum. The error function is convex, so that the algorithm will always converge to 
the global minimum o. 



262 



C.-Y. Chen and R. Sara 




1 

0.8 

0.6 

Z 

0.4 

0.2 

0 

150 



140 



(a) 




Fig. 7. Fused surfaces for (a) zi and (b) Z 2 ■ 
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7 Results 



After applying the fusion algorithm to the two different types of data, we obtained the 
surfaces as shown in Fig.Q 

Figure 0a) shows the surface obtained by fusion of orientation and depth data for 
surface Z\ . The shape and amplitude of the recovered surface conform well to the original 
surface. Also notice that there are no visible artifacts along the cavity boundary in the 
resultant surface. 

The surface recovered by fusion of data for surface Z 2 is shown in Fig. 0b). The 
recovered surface also retains the general shape of the original surface, but the amplitude 
of the recovered surface differs slightly from the original surface. There are also no 
artifacts on the fused surface along the cavity boundary. In fact, the recovered surface 
is a continuous surface, whereas the original surface has discontinuous folds along the 
cavity boundary and within the concave region. The reason is that the fusion method 
assumes that the unknown function z has to be continuous, hence the sharp folds of 
the original surface are not recovered. 

From Fig. 0a), it can be seen that the recovered cavity is not as deep as the cavity 
on the original surfaces. The result can be more clearly examined in Figs. 0and0 

FigurelHla) shows cross sections at y = 0 of the true surface zi , the surfaces recov- 
ered by simulating shape from contours and photometric stereo, and the fused surface. 
Figure|3a) shows the same for surface Z 2 - With reference to the figures, it can be seen 
that the surfaces recovered by shape from contours, Zgfc, have good conformity to the 
original surfaces in most regions, except for the cavities. Whereas the surfaces recovered 
by photometric stereo, Zpsm, conform to the original surfaces quite well in the concave 
regions, but suffer from the effect of cumulated errors towards the positive x direction. 
The surfaces obtained by fusion, z fused, he between the surfaces recovered by photo- 
metric stereo and shape from contourshave the combined advantages from both methods. 
Like the surface recovered by shape from contours, the surface recovered by fusion con- 
forms to the original surface quite well in most regions. But unlike Zgfc, z fused has a 
cavity in the concave region, even though the cavity is less emphasised when compared 
to the original surface. The insufficient depth of the recovered cavity is a major source 
of error in the surface recovered from fusion. The shallower cavity is partly due to the 
fact that the weighting functions are binary, thus the contribution from each type of data 
is either 0 or 1 . The smoothness constraint of the recovered surface is another cause for 
the insufficiently recovered cavity. 

Figure0b) shows the sum of absolute errors along the y direction between zi and 
surfaces acquired using different methods. Figure 0b) shows the errors for surface Z 2 - 
From the figures, it can be seen that the errors from Zgfc are fairly constant, apart from 
the concave regions. While Zpsm appears to conform to the true surfaces quite well in the 
selected cross sections, the larger error magnitudes indicate that it is more erroneous over 
all, for surfaces Zi and Z 2 ■ The cumulative errors for Zpsm can also be easily observed, 
as the errors increase towards the positive x direction. The errors for z fused are similar 
to the errors from Zsfc for regions without cavities. Therefore, Zfused is generally less 
erroneous than Zpsm- On the other hand, z fused recovers cavities to certain extent, hence 
the errors in the cavity regions are less than that of Zsfc- 
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Overall, from Figs. inland 13 it can be seen that the surface recovered from fusion of 
depth and orientation data gives better result than the surfaces obtained by either depth 
or orientation data alone. 

8 Evaluation 

In this section, the recovered surfaces are compared with the ground truth to evaluate 
the accuracy of each method. 

The error function is given by e = z — 5, where z is the surface obtained by fusion 
of original, un-corrupted data, and z is the surface recovered from simulated data, which 
are corrupted by noise. 

For surface zi , 



Table 1. Errors between zi and surfaces recovered using different methods. 



(%) 


Fused 


SEC 


PSM 


min |e| 


0.0005 


0.0018 


0.0000 


max \e\ 


16.5979 


19.3352 


7.8308 


mean \ e\ 


1.5120 


1.5617 


2.5173 


RMS 


1.5630 


1.8562 


1.7958 



For surface Z 2 , 

Table 2. Errors between Z 2 and surfaces recovered using different methods. 



(%) 


Fused 


SEC 


PSM 


min \e\ 


1.1752 


1.1769 


0.0070 


max \e\ 


24.9002 


101.5993 


15.4209 


mean \e\ 


2.3244 


3.5532 


5.3134 


RMS 


1.1657 


8.5130 


3.6062 



It can be seen from the tables that the surfaces recovered by fusing of data are more 
accurate than the surfaces recovered hy either shape from contours or photometric stereo 
alone. 

9 Discussion 

In this work, we have tested the fusion method, as well as photometric stereo and shape 
from contours on two different types of surfaces. The first surface is a continuous 
surface constructed from Gaussian functions. The second surface is a polyhedral surface 
constructed from planar patches. The continuous surface is an example of general 
surfaces, and the polyhedral surface is an example of simple surfaces. 
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Fig. 8. Cross sections of (a) resultant surfaces and (h) error, for surface zi. 
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(a) 




X 



(b) 



Fig. 9. Cross sections of (a) resultant surfaces and (b) error, for surface . 
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The depth and orientation data are obtained by simulating the results as would be 
obtained by shape from contours and photometric stereo. The depth data recovered by 
shape from contours are simulated by discretising surface heights at fixed grid positions, 
which introduces errors into the simulated depth values. The orientation data recovered 
by photometric stereoare simulated by approximating the partial derivatives with the 
differences in surface height between adjacent positions at discrete grid positions. Since 
the approach provides approximate partial derivatives, errors are thus introduced into the 
orientation data. Apart from the errors introduced by the discretisation of data, random 
noises were further added to the simulated data. 

Once the orientation and depth data have been obtained, they are used to calculate the 
data compatibility function. The main purpose of the compatibility function is to deter- 
mine the concave region on the surface. In this work, the compatibility of the different 
data is based on the dot product of the surface normals calculated from photometric 
stereo and shape from contours. However, the compatibility function calculated by this 
approach is unable to provide all of the points that lie within the concave region, since 
at some positions within the cavity, the surface normals obtained by photometric stereo 
agree with the normals obtained by shape from contours. Furthermore, positions that do 
not lie within the cavity may also be determined as a cavity point, simply because of the 
introduced noise. One way to improve the detection of cavities is to take the compati- 
bilities of neighbouring positions into consideration. For example, if the neighbours of 
a point are all incompatible, yet the point itself is compatible, then it is highly possible 
that the point lies within the cavity. The same approach can be taken to eliminate points 
that do not lie within the cavity. 

Alternatively, the compatibility function can be given as the confidence with which 
different types of data agree and has continuous rather than binary values. In which case, 
the confidence can be indicated by the differences between two types of data, as well as 
the compatibility of neighbouring positions. 

The weighting functions are calculated according to the data compatibility function. 
The data weighting functions should be constructed such that the unknown surface will 
generally be recovered from the depth data provided by shape from contours, since shape 
from contours provides reliable dimensions of the object. However, the unknown sur- 
face needs to be recovered according to the surface orientation data in cavity regions, 
because photometric stereo is able to recover orientation data within the cavities where 
shape from contours is unable to. In this work, we have used binary weighting functions 
for the purpose of evaluation. But such inflexibility of weighting values has caused the 
recovered cavities to be shallower than the original cavities. One approach to improve 
the result is to increase the orientation data weighting function in the concave region. 
However, more generally, the weighting function can be made to vary continuously with 
respect to the data compatibility. For example, the depth weighting function may vary 
proportionally to the data compatibility, such that the weight increases as the confidence 
in data compatibility increases. The orientation data weighting function may vary in- 
versely with the data compatibility, and with higher values in the concave regions to 
emphasis the recovered cavities. 

The orientation and depth data, as well as the respective weights are given to the 
fusion algorithm to recover the unknown surface. The fusion algorithm has assumed that 
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the unknown surface is continuous, which might not always be the case, as can be 
seen from the smoothing of the folds in the recovered surface for the simple surface Z 2 ■ 
The coefficients of the different constraining error functions have all been set to 1 in 
this work. These values may need adjustment to lessen the effect of the continuity or 
smoothness constraints. 

10 Future Work 

Our next task is to apply the fusion method on data obtained using shape from contours 
and photometric stereo, rather than simulated data. The errors contained in the recovered 
orientation and depth data cause difficulties in the fusion of real data. Therefore, the 
functions involved in data fusion will need to be more robust and resistant to errors. The 
data compatibility function requires improvement to provide reliable cavity detection in 
spite of the noisy data. The weighting functions may need to take the variance of the 
input data into consideration to avoid emphasising erroneous data. The error functions 
used in the fusion algorithm also need to be modified to handle real data. Furthermore, 
if the variances of the errors are known, they can be incorporated into the fusion process 
to compensate for the errors in the fused result. 

Since the fused surface generally conforms with the original surface apart from the 
cavity regions, it may be possible to refine the fused surface by inputting the fused 
surface into the fusion algorithm to be combined with the orientation data. Alternative 
orientation and depth weights can be calculated by comparing the fused surface with the 
orientation and depth data. The process of fusion may be repeated to further refine the 
resultant surface. 

11 Conclusion 

This work is a preliminary step towards the integration of photometric stereo and shape 
from occluding contours. In this work, we have performed fusion on simulated surface 
data and acquired more accurate surface recovery for different types of surfaces. The data 
being fused are the orientation and depth data obtained by simulating the photometric 
stereo method and the shape from contours, respectively. 

Our first objective is to determine if surface obtained by fusing orientation and depth 
data is more accurate than surface obtained by either type of data alone. This has been 
achieved by quantitatively comparing the surfaces recovered from simulated data with 
the surface obtained from original data. It has been found that the surfaces recovered 
by fusion is more accurate than the surfaces recovered by either photometric stereo or 
shape from contours. 

The second objective is to see if there are any observable artifacts along the cavity 
boundary where different types of data are fused. A sharp transition of different types 
of data occur along the cavity boundary, since the surface orientations obtained by 
photometric stereo is used to complement the inability to recover cavities in shape from 
contours. By examining the surface recovered by fusion, it has been found that there are 
no observable artifacts along the cavity boundary, where the transition from one type of 
data to the other occurs. 
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We have proposed a method for determining the reliability of the depth and orien- 
tation data respectively obtained by shape from contours and photometric stereo. The 
fusion process is performed by combining portions of orientation and depth data that 
have been determined to be reliable. 

In the experiment, the shape from contours method is able to provide accurate depth 
recovery of the surface except for the concave regions. The maximum error occurs in 
the concave regions since the cavity cannot be recovered from the contours alone. On 
the other hand, the photometric stereo method is able to recover the surface cavities 
with accuracy, but the surface recovered from the orientations alone is more erroneous 
overall, as indicated by the mean errors. 

The results of our simulated experiment are encouraging. From the evaluations, it 
has been seen that The surface recovered by fusing depth and orientation data is more 
accurate than the surfaces recovered using either depth or orientation data alone. 

Finally, we discussed some modihcations for the procedures involved in fusing ori- 
entation and depth data, such that the fusion method may be applied on real data. Possible 
future work for improving result obtained by fusion of orientation and depth data have 
also been discussed. 
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Abstract. Automatic gesture recognition systems generally require two 
separate processes: a motion sensing process where some motion features 
are extracted from the visual input; and a classification process where 
the features are recognised as gestures. We have developed the Hand 
Motion Understanding (HMU) system that uses the combination of a 
3D model-based hand tracker for motion sensing and an adaptive fuzzy 
expert system for motion classification. The HMU system understands 
static and dynamic hand signs of the Australian Sign Language (Auslan) . 
This paper presents the hand tracker that extracts 3D hand configuration 
data with 21 degrees-of-freedom (DOFs) from a 2D image sequence that 
is captured from a single viewpoint, with the aid of a colour-coded glove. 
Then the temporal sequence of 3D hand configurations detected by the 
tracker is recognised as a sign by an adaptive fuzzy expert system. The 
HMU system was evaluated with 22 static and dynamic signs. Before 
training the HMU system achieved 91% recognition, and after training 
it achieved over 95% recognition. 



1 Introduction 

Deaf communities in Australia use a sign language called Auslan. Signers use a 
combination of hand movements, which change in shape and location relative to 
the upper body, and facial expressions. Auslan is different from American Sign 
Language or indeed any other sign language, though it is related to British Sign 
Language. As is the case in other countries, Auslan has rules of context and 
grammar that are separable from the spoken language of the community, in this 
case English. Despite the effort to educate the deaf community to master the 
written form of the spoken language, there is still a vast communication barrier 
between the deaf and aurally unaffected people, the majority of whom do not 
know sign language. 

Thus, there is a need for a communication bridge between Auslan and spoken 
English and a means whereby unaffected people can efficiently learn sign lan- 
guage. An automated communication tool must translate signs into English as 
well as translate English into signs. Sign to English translation could be achieved 
by using a visual gesture recognition system that must recognise the motion of 
the whole upper body, including facial expressions. As an initial step towards 
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building such system, we developed a framework for the Hand Motion Under- 
standing (HMU) system that understands one-handed Auslan signs. The HMU 
system uses a combination of 3D tracking of hand motion from visual input and 
an adaptive fuzzy expert system to classify the signs. Previously, these techniques 
have not been used for gesture recognition and they are presented in this paper. 
Automated gesture recognition has been an active area of research in human- 
computer interaction applications and sign language translation systems. The 
gesture recognition may be performed in two stages: the motion sensing, which 
extracts useful motion data from the actual motion input; and the classifica- 
tion process, which classifies the movement data as a sign. Current vision-based 
gesture recognition systems mmm extract and classify 2D hand shape infor- 
mation in order to recognise gestures. The representation of 3D hand postures 
or 3D motion by using 2D characteristic descriptors from a single viewpoint, 
has its inherent limitations. In order to overcome these limitations, Watanabe 
and Yachida ^ approximate 3D information by using an eigenspace constructed 
from multiple input sequences that are captured from many directions, without 
reconstructing 3D structure. The idea of a vision-based sign recognition system 
that uses 3D hand configuration data was previously suggested by Dorner |n|. 
She developed a general hand tracker that extracts 26 DOFs of a single hand 
configuration from the visual input as a first step towards an American Sign 
Language (ASL) recognition system. She used a colour-coded glove for easier 
extraction of hand features. Regh and Kanada 0, on the other hand, developed 
a hand tracker that extracts 27 DOFs of a single hand configuration from un- 
adorned hand images. Both trackers were developed as a motion sensing device, 
and have not been tested to recognise meaningful gestures. The HMU system 
0 recognises Auslan hand signs by using the 3D model-based hand tracking 
technique to previous approaches. While the HMU system employs a similar 
tracking technique, our tracker handles occlusion of fingers to some degree, and 
uses a simpler hand model with only 21 DOFs. The 3D hand tracker produces 
the kinematic configuration changes as motion data, which are similar to data 
obtained from Virtual Reality (VR) gloves. In the existing VR-based gesture 
recognition systems, the 3D motion data are classified by either using neural 
networks m PI or Hidden Markov Models 0 m- Sign language signs are very 
well-defined gestures, where the motion of each sign is explicitly understood by 
both the signer and the viewer. However, the motion of signers varies slightly 
due to individual physical constraints and personal interpretation of the signing 
motion. We have earlier proposed a classification technique PH that is capable 
of imposing expert knowledge of the input/output behaviour on the system yet 
also supports data classification over a range of errors in the motion sensing 
process or slight individual hand movement variations. This is achieved by using 
an adaptive fuzzy expert system. The HMU system employs this technique to 
classify the 3D hand kinematic data extracted from the visual hand tracker as 
signs. 
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2 Overview of the HMU System 

The HMU system recognises static and dynamic hand signs by dealing with “fine 
grain” hand motion, such as the kinematic configuration changes of fingers. A 
signer wears a colour-coded glove and performs the sign commencing from a 
specified starting hand posture, then proceeds on to a static or dynamic sign. A 
colour image sequence is captured through a single video camera and used as in- 
put to the HMU system. The recognition is performed by using the combination 
of a 3D tracker and an adaptive fuzzy expert classifier. The recognition process 
is illustrated in Figure E 

The 3D model-based hand tracker extracts a 3D hand configuration sequence 
(each frame containing a set of 21 DOF hand parameters) from the visual input. 
Then, the adaptive fuzzy expert system classifies the 3D hand configuration 
sequence as a sign. This paper presents the techniques used in the hand tracker 
and the classifier, and its performance evaluation for the recognition of 22 static 
and dynamic signs. Our system achieved a recognition rate of over 95%, and 
demonstrated that the tracker computes the motion data with an accuracy that 
is sufficient for effective classification by the fuzzy expert system. Throughout 
this paper, a hand posture refers to a 3D hand configuration. Thus a static sign 
may be recognised by a hand posture only, while a dynamic sign is recognised 
by 3D hand motion that consists of changes of hand postures and 3D locations 
during the course of signing. 



3 The HMU Hand Tracker 

The hand tracker uses a 3D model-based tracking technique, where given a se- 
quence of 2D images and a 3D hand model, the 3D hand configurations captured 
in the images are sequentially recovered by processing a sequence of 2D images. 
The hand tracker consists of three components: 

— The 3D Hand model, which specifies a mapping from 3D hand posture 
space, which characterises all possible spatial configurations of the hand, to 
2D image feature space which represents the hand in an image. 

— The feature measurement that extracts the necessary features from im- 
ages. 

— The state estimation, which makes corrections to the 3D model state in 
order to fit the model state to the 3D posture appearing in the 2D image. 
Throughout the sequence of 2D images, incremental corrections are made to 
the 3D model. 



3.1 Hand Model 

A hand is modelled as a combination of 5 finger mechanisms (totalling 15 DOFs) 
each attached to a wrist base of 6 DOFs. The model represents a kinematic chain 
that describes the hand configuration, where a model state encodes the hand 
posture by using the 21 DOFs, as illustrated in Figure 0 
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Fig. 1. Structure of the HMU system. The HMU hand tracker extracts 3D hand con- 
figuration data from the images, then the HMU classifier recognises them as a sign. 



Each of FI, F2, F3 and F4 finger mechanisms has 3 DOFs, which consist 
of 2 DOFs for the MOP (Meta Carpo Phalangeal) joint, and 1 DOF for the 
PIP (Proximal Inter Phalangeal) joint. FO also has 3 DOFs, which consist of 2 
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F2 




Fig. 2. Hand model used in the HMU hand tracker. The base coordinate frame for 
each joint and their transformation through rotations and translations are illustrated. 



DOFs for the CMC (Carpo Meta Carpal) joint, and 1 DOF for the MCP joint. 
Thus the model describes the transformation between attached local coordinate 
frames for each finger segment, by using the Denavit-Hartenberg (DH) repre- 
sentation, which is a commonly used representation in the field of robotics m- 
For example, the transformation matrix for the wrist segment, describing the 
position and orientation of the wrist frame relative to some world coordinate 
frame may be calculated as follows: 

'^origin = Trans{jo, 0, 0) ■ Trans(0, 7i,0) • Trans(0, 0, 72)- 
Rot{z,"fs) ■ Rot{y,^i) ■ Rot{x,j5). 

For the other segments of FI, and similarly for F2, F3 and F4, we have 

Twrist = Trans{xi,0,0) ■ Trans{0,yi,0) 

Rot{z,a^)- Rot{y,a 2 ), 

'^EiCP ~ Trans{dx 3 , 0, 0) • Rot{y, 03), and 

Tpjp = Trans{dx4,0,0). 
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For the segments of FO, we have 



= Rot{z,Ti) ■ Rot{y,T2), 



'^CMC = Trans{dxi,0,0) ■ Rot{z,T^),a,nd 
"^'mcp = Trans{dx2,0,0)- 

Thus, the hand state encodes the orientation of the palm (three rotation and 
three translation parameters) and the joint angles of the fingers (three rotation 
parameters for each finger and the thumb). 



4 Feature Measurement 

In order to locate the joint positions of the hand in images, a well-fitted cotton 
glove is used and the joint markers are drawn with fabric paints, as shown in 
Figure 01 The feature measurement process performs rapid marker detection by 
colour segmentationand determines the corresponding joints between the marker 
locations and the joints in the model. 




Fig. 3. Colour coded glove. Ring-shaped markers are applied at the wrist (in green), 
the PIP and TIP joints of four fingers (fluorescent orange for FI, green for F2, violet 
for F3, and magenta for F4), and the MCP joint and TIP of the thumb (in blue). 
Semi-ring-shaped markers are used for the MCP joints of fingers (in yellow). 



However imposter or missing markers arise when marker areas are split or 
disappear entirely, and are usually caused by finger occlusions. The tracker com- 
putes the marker size, and if a sudden change in size (or a disappearance) occurs 
from the previous frame, it assumes the marker is partially occluded and regards 
it as a missing marker. The HMU tracker deals with the missing marker problem 
by predicting the location of the missing marker. This is achieved by using the 
changes of the 3D model state estimates of the 5 previous frames in order to 
predict the 3D model state (for all parameters of the model) that may appear in 
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the image, and generating the predicted joint positions by projecting this state 
onto the image. Kalman filtering is used for the prediction. Figure 0 illustrates 
the prediction process. 



FEATURE MEASUREMENT 






missing marker positions 



9 



t 



MODEL PROJECTION 



PREDICTION 




Fig. 4. Prediction of the missing marker. The tracker uses a limited case of Kalman 
filtering to predict the state estimate based on the previous estimates, which is then 
projected onto an image in order to find a predicted location of the missing marker. 



4.1 The State Estimation 

Given a 2D image and the 3D initial model state (the current estimate in our 
system), the state estimation calculates the 3D parameter corrections that need 
to be applied to the model state to fit the posture appearing in the image. The 
state estimation process is shown in Figure 0 

The parameter corrections are calculated by minimising the Euclidean dis- 
tances between the image features (that are extracted by the feature measure- 
ment process), and the projected features of the predicted model state (that are 
calculated from the model projection process). State estimation employs Lowe’s 
object tracking algorithm m , which uses a Newton-style minimisation approach 
where the corrections are calculated through iterative steps, and in each step the 
model moves closer to the posture that is captured in the image. 

Parameters. We define the parameter vector to be, 

a = (o!i,a2, • • • a™)^, 

where n is the total number of parameters. The wrist model consists of 6 pa- 
rameters (that is, the x, y, and z translation parameters and the 3 rotation 
parameters for the wrist). A finger uses 3 rotation parameters as previously 
shown in Figure El 

Projected Features. The projection of the fth joint onto an image is a function 
of the hand state d, and is given by 
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Fig. 5. One cycle of state estimation. All model parameters are corrected iteratively 
in order to fit the model to the posture appearing in the image. 



For the whole hand, these vectors are concatenated into a single vector, and for 
convenience we define qi{a) = pix{a), 92 (d) = piy(d), etc. Thus, 



q{a) 



/ Pix(d)\ 
Piy(d) 



Pkx{a) 

\Pky{a)J 



/ 91 (d) \ 
92 (d) 

9m-i(d) 

V 9m (d) / 



where k is the total number of joints (thus m = 2k). 

Tracking the palm or a finger requires 3 joints. Palm tracking uses the wrist 
and the knuckles of FI and F4, whereas finger tracking uses the knuckle, PIP 
and TIP of the finger. 



Error Vector. The measured joint locations are the joint positions, which are 
obtained by the feature extraction process from an image. As with the projected 
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joints, the measured feature locations are concatenated into a single vector, g, 
and then the error vector describing the difference between the projected and 
measured joint positions is defined by 

e = q{5t) -g. 

Implementation of Lowe’s Algorithm. We compute a vector of corrections 
c to be subtracted from the current estimate for a, the model parameter vector, 
on each iteration. This correction vector is computed using Newton’s method as 
follows: 

d = _5b+i). 

Using Lowe’s algorithm, the tracker solves the following normal equation to 
obtain the correction vector c: 

( J + XW^W)c = J'^e + XW^Ws, 



where J is the Jacobian matrix of q, defined by 



dq{a) 

da 



/ dqi(a) 
doil 



dqi{a) 
\ darn 



dqmjd) \ 

dai 



dqmjd) 

dam 



The matrix IT is a normalised identity matrix whose diagonal elements are 
inversely proportional to the standard deviation at of the change in parameter 
Qfi from one frame to the next, that is Wu = ^, is the desired default value 
for parameter Oi, and A is a scalar weight. 

The above equation is driven to minimise the difference between the measured 
error and the sum of all the changes in the error resulting from the parameter 
corrections. The stabilisation technique uses the addition of a small constant to 
the diagonal elements of J in order to avoid the possibility of J being at or 
near a singularity. This is similar to the stabilisation technique often used in 
other tracking systems |^. 

In this algorithm, the standard deviation of parameter changes in consec- 
utive frames represents the limit on the acceleration of each parameter from 
frame to frame. For translation parameters, a limit of up to 50 pixels (within 
the image size of 256 x 192) is used as the standard deviation, but for rota- 
tional parameters, ranges from 7 t/ 2 up to 7 t/ 4, depending on the finger joint, 
are used as standard deviation. The scalar A can be used to increase the weight 
of stabilisation whenever divergence occurs, but a constant scalar of 64 is used 
in the HMU system to stabilise the system throughout the iterations. For each 
frame of the sequence, the correction vector is calculated and the model state is 
updated iteratively until the error vector is small, at which point the measured 
hand shape is close to the projected hand model. Thus for each frame of the 
input sequence, the hand tracker generates a hand state vector that consists of 
21 DOFs representing the hand configuration. 
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5 The HMU Classifier 

The HMU classifier recognises the 3D configuration sequence that was extracted 
from the hand tracker as a sign. A frame in the 3D hand configuration sequence 
will be referred to as a kinematic data set, and an example of the set is defined 
as kiri-pos. 

From the previously shown Figure 0 kin_pos is a vector of 15 finger joint 
angles (3 DOFs in the MCP and PIP joints of each of the five fingers). Note that 
even though the tracker recovers 21 DOFs of the hand, the 6 parameters of the 
wrist translation and orientation are not used for sign classification at this stage 
of the development. The adaptive fuzzy expert classifier relies on the sign rules 
that use the following knowledge representation. 

5.1 Knowledge Representation 

In the HMU system, a sign is represented by a combination of: 

— a starting hand posture; 

— motion information that describes the changes that occurred during the 
movement, such as the number of wiggles in a finger movement; and 

— an ending hand posture. 

The starting and ending hand postures are defined by using Auslan basic 
hand postures m The HMU system uses 22 postures that are a subset of 
Auslan basic hand postures and their variants. Fuzzy set theory is applied 
to all posture and motion variables to provide imprecise and natural descriptions 
of the sign. Postures are represented by the following variables. 

— Finger digit flex variables are defined for all of FO (d_FO), FI (dJFl), F2 
(d_F2), F3 (d_F3), and F4 (d_F4). The states of d_FO may be straight 
(st), slightly flexed (sf), or flexed (fx). The states of the other digit flex 
variables may be straight (st) or flexed (fx). An example of FO digit flex 
variable states and their default fuzzy membership distributions are shown 
in Figure 0 

— Finger knuckle Hex variables are defined for FI (k_Fl), F2 (k_F2), F3 
(k_F3) and F4 (k_F4), and the states may be straight (st) or flexed (fx). 

— Finger spread variables (FS) represent the degree of yaw movement of the 
MCP joints of FI, F2, F3 and F4, and the states may be closed or spread. 

In the sign knowledge representation, motion is represented by the number 
of directional changes (wiggles) in the movement of finger digits, finger knuckles, 
and finger spreading. We assume 5 states are possible: no wiggle (nw), very small 
wiggle (vsw), small wiggle (sw), medium wiggle (mw), and large wiggle (Iw). 
Note that a state of a posture or motion variable is defined by a state name, 
followed by the variable name. For example, a flexed digit of FI is fx_d_Fl, 
and no wiggle motion in an FO digit flex would be represented as nw_d_FO. An 
example sign representation is illustrated in Figure 0 
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Front view of FO digit flex FO digit flex variable fuzzy set functions 



Fig. 6. FO digit flex variable states, and their default fuzzy membership distributions. 
A triangular distribution function has been used for all fuzzy membership distributions 
in our sign representation. 
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Fig. 7. Graphical description of sign_scissors and its corresponding sign representa- 
tion. 



5.2 Classification 

The classification is performed through three stages. Firstly, the classifier anal- 
yses each frame of the hand configuration sequence, and recognises the basic 
hand postures. Secondly, it determines the starting and ending postures as well 
as the motion that occurred in between. Then thirdly, a sign is recognised. The 
recognition of both basic hand postures and signs is performed by the fuzzy 
inference engine that also generates an output confidence, or Rule Activation 
Level (RAL). This is shown in Figure 0 

5.3 Adaptive Engine 

Fuzzy set theory allows the system to tolerate slight tracker errors or movement 
variations. However, the fuzzy expert system may produce a low decision confi- 
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Fig. 8. Sign Classification using the fuzzy inference engine. 



dence (RAL), or fail if the input lies near or outside the boundary of the fuzzy 
set. Therefore, we made our fuzzy system adaptive. In the HMU classifier, dy- 
namic adjustments to the individual fuzzy distributions are performed under a 
supervised learning paradigm. The adaptive engine modifies fuzzy set regions by 
slightly narrowing or widening the region depending upon whether the system’s 
response was above or below expectation, respectively m As the training data 
are entered, the system classifies them into output signs and their corresponding 
RALs. Then according to the output, the fuzzy regions are modified. 

6 Sign Recognition 

The input image sequence always starts with a specified posture, posture-flatO, 
which appears in Figure 0 The hand then moves to the starting posture of the 
sign and performs the sign until it reaches the ending posture of the sign. 

6.1 Signs Used in the Evaluation 

The HMU system stores the 22 postures that are illustrated in Figure 0 and 22 
signs that consist of 11 static signs and 11 dynamic signs as shown in Figure [E3 
Signs consist of actual Auslan signs as well as artificial signs that use various 
combinations of the basic hand postures and motion. 

One signer recorded the image sequences, wearing the colour-coded glove and 
signing under the fluorescent lighting of a normal office environment. 

For evaluation, 44 motion sequences that consist of two sub-sequences for 
each of the 22 signs were recorded by using a single video camera. To enable a 
fair test to be conducted, half of the recorded sequences were used for testing. 
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posture _flatO 




posture_flatl 




posture_flat2 




posture _spreadO 




posture_tenO 




posture _goodO 




posture_goodl 




posture _pointO 




posture_hookO 




posture_gunO 




posture _spoonO 




posture_eightO 




posture _tVi>oO 




posture_badO 




posture_twoI 




posture_motherO 




posture_threeO posture_fourO posture_ambivalentO posture_animalO 




posture_okO 




posture _queerO 



Fig. 9. Illustrations of Anslan basic hand postures used in the evaluation. 



and the other half were used for training of the HMU classifier. One sequence 
for each sign was randomly selected, producing the total of 22 sequences as a 
test set. The remaining 22 sequences were used as a training set. 

6.2 Recognition Results 

Prior to training, the system correctly recognised 20 out of the 22 signs. After 
training, for the same test set, the system recognised 21 signs. For all failed cases, 
the system did not produce false output. Figure CH illustrates the results by 
showing the sign RAL for each of the recognised signs before and after training. 

Given the complexity of extracting and recognising 3D hand configuration 
data from the visual input, the HMU system achieved a very high recognition 
rate. Recognition results of sign-dew are shown in Figure fT2l The tracker result 
is graphically shown under each image frame, and the posture recognition results 
before (b/t) and after (a/t) training are shown. Note that only every third frame 
is shown, and each posture and sign recognition result is accompanied by a RAL. 
The adaptive engine aims to modify the fuzzy set functions in order to improve 
the system’s behaviour by adjusting the acceptable range of variations in hand 
configuration data when classifying the signs. Thus the training should make 
appropriate adjustments to all fuzzy set regions in order to achieve an improved 
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sign 


starting 

posture 


intermediate 

postures 


ending 

posture 


point 


pointO 




pointO 


ambivalent 


ambivalentO 




ambivalentO 


queer 


queerO 




qneerO 


good 


goodO 




goodO 


gun 


gunO 




gnnO 


ok 


okO 




okO 


two 


twoO 




twoO 


four 


fourO 




fourO 


dark 


twol 




twol 


hook 


hookO 




hookO 


Spoon 


spoonO 




spoonO 


dew 


pointO 




spreadO 


ten 


tenO 




spreadO 


good_animal 


goodO 




animalO 


have 


spreadO 




tenO 


spread 


flat2 




spreadO 


fist_bad 


tenO 




badO 


good_spoon 


goodO 




spoonO 


flicking 


okO 




spreadO 


queer jlicking 


queerO 




spreadO 


scissors 


twoO 


spoonO twoO 


spoonO 


quote 


twoO 


twol twoO 


twol 



Fig. 10. Signs used in the evaluation of the HMU system. Note that to perform a sign, 
the hand moves from the specified starting posture to possibly intermediate postures 
until it reaches the ending posture. 



recognition rate, higher RALs, as well as producing fewer posture outputs for 
each sequence. 

A close observation shows that the tracker produces quite significant errors 
(up to 45 degrees) for either the MCP or the PIP joint flex angles for some 
motion sequences. This has caused the confusion between two close postures, 
posturespoonO and postureJ,woO, resulting in the failure of sign_scissors after 
training (sign_scissors use both of postureAwoO and posturespoonO as subpos- 
tures during its execution). The overall recognition results, however demonstrate 
that the HMU tracker has generated hand configuration data with an acceptable 
range of errors for training of the system, by making the system more selective 
in recognition of postures. This is shown in the recognition of the signs that 
were not recognised before training but were recognised after training, and the 
reduction in the posture outputs by an average of 10.7%. Figure IT^ shows a 
recognition result of sign-good-animal, which was not recognised before training, 
but successfully recognised after training. 
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before training 


after training | 


success 


RAL 


no. of pos. outputs 
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87 
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13 
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26 
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15 
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38 
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21 
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0.58 


47 
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19 


v/ 


0.8 


19 
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67 


ten 




0.32 
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44 


good_animal 


- 


- 
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0.6 


36* 
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67 
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67 
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0.37 


38 




0.28 


36 


good_spoon 


- 


- 


(63)* 




0.58 


56* 
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0.27 


34 




0.26 


32 
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0.46 
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0.46 


83 


scissors 




0.63 


205* 


- 


- 
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quote 




0.41 


96 




0.41 


79 



number of signs 


number of sequences per sign 


total number of test sequence 


22 


1 


22 



1 Recognition results 




before training 


after training 


number of success 


20 


21 


success rate (%) 


91 


95 


av. reduction rate for the posture 
outputs after training (%) 


10.7 



Fig. 11. Evalnation Results. A in the ‘success’ column indicates that the sign is 
recognised correctly, and a dash indicates that no ontput is prodnced. An asterisk in 
the ‘no. of pos. outpnt’ column indicates the figure that is not included in calculating 
the average reduction rate for the posture outputs after training (only the signs that 
were recognised before and after training are used for the calculation). 



7 Conclusion 



The HMU system successfully recognised various ‘fine-grain’ hand movements by 
using a combination of the 3D hand tracker as a low-level motion sensing device 
and the fuzzy expert as a high-level motion understanding system. The tracker 
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Fig. 12. Recognition result of sign^dew before and after training. 



analyses a sequence of images to produce the changes of 21 DOFs of the hand 
including the orientation and trajectory of the hand (on the wrist base). This 
is achieved by employing a computer vision-based feature extraction technique 
and robotics-based 3D model manipulation. The hand configuration data are 
then classified by the fuzzy expert system, where the sign knowledge is defined 
by high-level, natural language-like descriptions of the hand movement using 
fuzzy logic. To build an automated communication tool between the deaf and 
the unaffected, we are continuing our research. The system not only needs to un- 
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Fig. 13. Recognition result of sign^good^animal before and after training. 



derstand fine-grain hand gestures, but also the hand trajectory, facial expression 
and lipreading. Thus the current projects include the development of the follow- 
ing systems: The facial expression recognition system recognises emotions using 
the facial muscle movement appearing in the visual input in order to provide 
an additional clue to sign recognition. The lipreading system visually recognises 
the signer’s speech in cases where the signer uses a combination of speech and 
signing, as often occurs in deaf education systems. The lipreading system detects 
the mouth contour and inner mouth appearance to recognise English phonemes. 
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The 3D head tracker recognises the 3D head orientation while signing. This is 

useful in facial expression recognition and lipreading systems to deal with the 

2D feature detection while the signer moves the head in 3D. 
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